The Setup:
I have a SSIS package and it runs a relatively basic ETL setup. We take data from TableA, move it to TableB. TableB is a heap, we index it in our ETL and drop it afterwards. Now we join TableB to TableC. The reason we don’t go from A to C is because of the data relationship, it can cause a cascade join that creates BILLIONS of reads when we plug in some where predicate values.
Once TableB is joined to TableC, we create our file out of that data and take it away to import-land, where we do important import stuff. This is the match that lit the fire.
The error:
Snapshot isolation transaction failed in database ‘%.*ls’ because the object accessed by the statement has been modified by a DDL statement in another concurrent transaction since the start of this transaction. It is disallowed because the metadata is not versioned. A concurrent update to metadata can lead to inconsistency if mixed with snapshot isolation.
MSDN on the error.
The job ran three times. On the third time, the job succeeded but no rows were imported. The previous two runs elicited the same error.
The where:
I looked at the following timestamps:
– Job failure and SSIS package failure timestamps
– Availability Group log timestamps
– Staging table (Table B) timestamps
– Destination table (Table C) timestamps
– Historical query data timestamps
SSIS failed only milliseconds after an error in the Availability Group logs appeared. Likewise, I had a mismatch of timestamps between the Staging table and Destination table. New data exists in staging but not in the destination table nor the destination file.
I found a link however that gave me an idea!
What if the query taking data from Table B and Table C to a file is what was causing the issues and there is not something running in parallel stopping me? The job had a flow that should not cause issues like this, there is no parallelism in this package. It is acting though as if I am using two connections and one connection is a readonly and the second connection is R/W and I’m holding metadata locks on the table.
I looked to the Availability Group logs and voila! Lock_redo_blocked, creating a SCH_M lock on the Metadata of the table in question.
The why:
Let’s go back to the setup, this time I’m going to make comments between the steps to help illustrate what’s going on.
I have a SSIS package and it runs a relatively basic ETL setup. We take data from TableA, move it to TableB.
This is on the primary node of a synchronous availability group, we have R/W privileges here.
TableB is a heap, we index it in our ETL and drop it afterwards.
Same deal, we are on the R/W connection in the AG.
Now we join TableB to TableC.
Rubber… meet road, road… meet rubber.
The primary AG node is having issues replicating the changes from node A to node B fast enough and is taking out a SCH_M lock on the table which is causing our error. The extract source is based off of the secondary node always, which is in synchronous mode.
Woah, wait! Your package succeeded the third time… but there were no rows? Your staging table must have had no records then.
Not quite, we had rows there. The table we were joining to had matches as well. We are querying the staging table on the secondary and the rows are not present.
Ok… that doesn’t make sense. You are in synchronous commit mode, node A persists as soon as node B does.
True, but there is a problem with this replication as per Microsoft and it is not error proof.
MSDN on why Availability Groups don’t love you anymore.
The client application completes an update on the primary replica successfully, but querying the secondary replica shows that the change is not reflected. This case assumes that your availability has a healthy synchronization state. In most cases, this behavior resolves itself after a few minutes.
If changes are still not reflected on the secondary replica after a few minutes, there may be a bottleneck in the synchronization work flow. The location of the bottleneck depends on whether the secondary replica is set to synchronous commit or asynchronous commit.
Those are the top paragraphs on that page. Ok… so that means it is possible that we read no rows from the secondary when they existed on the primary before the AG was able to replicate or lock the schema. Why?
Each successful update on the primary replica has already been synchronized to the secondary replica, or that the log records have already been flushed for hardening on the secondary replica. Therefore, the bottleneck should be in the redo process that happens after the log is flushed on the secondary replica.
Bottlenecks. Alright, what are the causes of bottle necks?
Well, the MSDN suggests that long running active transactions, high network latency or low throughput, another workload blocks the redo thread, or redo thread falls behind because of resource contention.
The solution:
Ultimately my solution was to change the data source from the read only secondary to the primary node assigned by the listener on a R/W connection. Now I don’t care if the secondary is lagging behind in this particular instance, but it looks like I have some query tuning to do!
I have added in a form for feedback, it would be very helpful if you would take a few minutes to fill this out. My goal is to adjust the content to best help others resolve their problems.