SSIS Archives - Jon Shaulis - Data Professional

T-SQL Tuesday #110 Automate your Toggl dataset with SSIS

Something I’ve automated* recently was my import of Toggl data! So T-SQL Tuesday #110 seems like a great time to post this solution. Thank you Garry Bargsley for the great idea and being this months host! I put an asterisks next to automated, as obtaining the file from Toggl and placing its location is not … Continue reading “T-SQL Tuesday #110 Automate your Toggl dataset with SSIS”

I put an asterisks next to automated, as obtaining the file from Toggl and placing its location is not automatic. They provide a premium feature to allow this or you could work on creating your own web scraper. (Which is probably against their rules of use on their site, since they are offering the file drop as a premium feature.)

So while that piece is not automatic, the ingestion, cleanup, and archive of data from the Toggl file is. Generally when I do automate a task, I will use a combination of Powershell, SSIS, BIML, Python, T-SQL, Bash, and whatever else may be necessary to get the job done.

Steve Jones got me hooked on Toggl back in July and I found it pretty easy to use. Since then I’ve begun creating analytics on tasks at my job and also in my life. I log my study time, blog work, and other extra curricular so I can begin to cultivate a data set about how I use my time.

Another thought, is that you can use Toggl with analytics to give your stakeholders insight into how your time is spent. Maybe you need to be left out of meetings because 25% of your days is spent in them? Maybe you have been doing too many projects for Todd from accounting? This gives your stakeholders a true insight into your day using tools to easily capture and identify your pain points which can be optimized.

Depending on your Toggl setup, you may not want to be very detailed with the information you provide. I personally normalize my data which is input into Toggl so it’s meaningless numbers without my master map which resides on my local computer. I want to keep my clients data secure and this information is hosted on Toggl’s servers. So that’s some food for thought, be mindful of what you are inputting into this online tool and what you’ve signed as far as Non-disclosure agreements.

So onto the SSIS package!

DISCLAIMER!!!

This should be done in a personal or safe environment, never production!

Created with Visual Studios 2017 Version 15.8.5, .NET Framework V 4.7.03056.

DBMS is Microsoft SQL Server 2017 (RTM) – 14.0.1000.169 (X64)

What does this SSIS package do? Well very simply, you export your detailed year view of Toggl (which is typically a pretty small data set, at least from my perspective), place it in a location, set your SSIS parameter for the folder location, create a new database or use an existing one and set it in the project connections, and let it rip.

Here is how you get your data from Toggl:

The home page for Toggl, click the Go to timer button at the top right.

*Once you arrive at Timer, you can click on Reports to the top left corner.*

Once at Reports, click Detailed at the top of the page towards the left, mosey over to the date drop down option on the right hand side and select "This year" or "Last year" depending on the data set you need to import, then hover over Export just below the year setting and select "Download CSV" — Once at Reports, click Detailed at the top of the page towards the left, mosey over to the date drop down option on the right hand side and select “This year” or “Last year” depending on the data set you need to import, then hover over Export just below the year setting and select “Download CSV”

Here is the final destination of the file I downloaded. You can place the file pretty much anywhere, just notate the location of the file for later.

Here is a link to my Github Project: https://github.com/Jonathanshaulis/Toggl_Import , so you can either fork or download the SSIS files. Place them somewhere on your computer where you want to store SSIS packages.

Once the files are downloaded and stored, you can open up the Solution file by double-clicking it. (Note: You will need Visual Studios installed along with SSDT including SSIS.)

Explanation of the SSIS package

The package starts by checking for the existence of the archive and staging schemas, it will create them if they don’t exist. It will then do the same for the tables, we have the primary table we use, the staging table, and the archive table for when you have years of data stockpiled over time.

We’ll truncate staging and then load the staging table immediately after. Using the base data set from Toggl, I found it difficult to analyze the summed durations, so I create a new column on import that takes the duration and turns it into a decimal. This made it more convenient for me to use Power BI to report off of.

Also in the Data Flow (DFT – File to Staging) we have two file counts occurring, one counts errors and the other counts records that flowed to the destination.

Errors are logged to DFTStagingCountErr and the successful imports are logged to DFTStagingCount.

Afterward, we gather a count of the rows in the staging table and assign the count to the variable ESTStagingCount.

There is a condition check after the count that will stop the process if the two variables DFTStagingCountErr and DFTStagingCount added together do not match the count of the variable ESTStagingCount. This is important because it ensures a successful import has occurred before it continues onward and truncates our primary table to perform a full load.

Once we are sure all records made it to our staging table, we truncate our destination and move our file to our “production” table. This gives a complete refresh of the data for the year, but this also serves a conundrum.

What happens when we have two years of data?

The final step archives the data set that you’ve imported.
The query performs a “merge upsert” with a not matched deletion at the end.
If a row matches exactly between the archive table and production table, it will update the Toggl_ID to keep the latest version of that row.
If a row is unmatched on all columns, it will interpret that as a new row and insert it.
If a row is unmatched on all columns and the date of the row matches the max date found in the production table, it will delete out the row. This logic is to allow for data changes to flow to the archive table making the hosted Toggl data source the golden data source. This means rows in 2018 would not be deleted from the archive source once you import a 2019 data set, but it would delete out a 2019 row if you changed the data and it would import the newly corrected row.

So how do I run it when I have it open?

The SSIS package is open here with two markers, #1 and #2. Down below I give details about how to adjust these two pieces in the SSIS package. After they have been adjusted, you can run the package assuming all other steps were accomplished.

#1 This is the properties of the Connection Manager (project)Local.

You’ll want to edit the Local connection manager to point to your instance. I would recommend to create a new database, something like “Career”. (I will be uploading more blog posts regarding Brent Ozar’s Salary Survey and creating a database portfolio in SSIS using the Career DB later.)

#2 This is the parameters screen to adjust the file location.

The last modification you need to make is to adjust the value for the Parameters Property File_Location. It doesn’t matter if you include the final backslash or not at the end of the string, I have error handling in an expression to make sure it is able to take either variation. Just make sure your file is in that location!

Voila! That’s it!

Download SSIS package / Toggl File.
Place SSIS package and Toggl File in ideal locations.
Open SQL Server, create a new Database called Career or anything else you choose, or just use an already existing database.
Open the SSIS package, adjust the Local connection to your instance and database.
Edit the Parameter for the File_Location and point it to your file.

F.A.Q. and Troubleshooting

If you have questions or run into issues, please let me know and I’ll see what I can do to help troubleshoot. I’ll modify this section as time goes on and as changes are needed.

Where did I put that thing in my SSIS project?

You may inherit a large SSIS project or create / use a large SSIS project someday, or perhaps there’s just a lot of packages. Either way, there’s going to come a day and time where you have to find something in your SSIS packages.

Say you need to remove a column in a table but you’re not sure where that column is referenced in your SSIS packages? Perhaps you are having metadata issues and you need to find every reference to that column? Or maybe you want to see what packages reference what tables, etc.

There are numerous ways to slice and dice this, I created essentially a powershell grep function to cursor through my dtsx packages and find strings referenced.

Here it is!

# Variables you should alter to your environment
$DirectoryToTarget="C:\Git\SSIS\PackageSets1\"
$WordToFind="TableSourceOne"
$PSGrepLog="C:\LogOutputs\PSGrepLog.txt"

# You will probably leave this variable alone. Set to search SSIS. 
$FilterType="*.dtsx"

Clear-Content $PSGrepLog

Get-ChildItem -Path $DirectoryToTarget -Filter $FilterType -Recurse | where { !$_.PSIsContainer } | % { 
    $file = Get-Content $_.FullName
    $containsWord = $file | %{$_ -match $WordToFind}
    If($containsWord -contains $true)
    {
        Add-Content $PSGrepLog $_.FullName
    }
}

Few notes here. It is not perfect, it is not case sensitive, and it will cursor child directories. It can take a few minutes to run depending on your system and the size of the directory. I would not recommend running this on a production server. You should copy the files you need to a nice location away from production and run this.

Once you have a list of packages with references, you can edit each package with a notepad editor or visual studio will let you examine the XML form. From here you can do a CTRL+F function on your keyboard and type in your searching keyword. It will take you to each instance / reference in your package, making it substantially quicker to find those rogue references.

Happy developing!

Edit on 5/4/18.

I have found out that visual studios actually offers the ability to perform the same functionality at a project level. I think my code is still helpful for particular situations, but you should be able to use the menu and find feature to search through all other packages too for your reference. I’m definitely saving my code as it should work across file types with a few minor tweaks.

Snapshot isolation transaction failed in database (Availability Group and SSIS woes)

The Setup:

I have a SSIS package and it runs a relatively basic ETL setup. We take data from TableA, move it to TableB. TableB is a heap, we index it in our ETL and drop it afterwards. Now we join TableB to TableC. The reason we don’t go from A to C is because of the data relationship, it can cause a cascade join that creates BILLIONS of reads when we plug in some where predicate values.

Once TableB is joined to TableC, we create our file out of that data and take it away to import-land, where we do important import stuff. This is the match that lit the fire.

The error:

Snapshot isolation transaction failed in database ‘%.*ls’ because the object accessed by the statement has been modified by a DDL statement in another concurrent transaction since the start of this transaction. It is disallowed because the metadata is not versioned. A concurrent update to metadata can lead to inconsistency if mixed with snapshot isolation.

MSDN on the error.

The job ran three times. On the third time, the job succeeded but no rows were imported. The previous two runs elicited the same error.

The where:

I looked at the following timestamps:

– Job failure and SSIS package failure timestamps
– Availability Group log timestamps
– Staging table (Table B) timestamps
– Destination table (Table C) timestamps
– Historical query data timestamps

SSIS failed only milliseconds after an error in the Availability Group logs appeared. Likewise, I had a mismatch of timestamps between the Staging table and Destination table. New data exists in staging but not in the destination table nor the destination file.

I found a link however that gave me an idea!

What if the query taking data from Table B and Table C to a file is what was causing the issues and there is not something running in parallel stopping me? The job had a flow that should not cause issues like this, there is no parallelism in this package. It is acting though as if I am using two connections and one connection is a readonly and the second connection is R/W and I’m holding metadata locks on the table.

I looked to the Availability Group logs and voila! Lock_redo_blocked, creating a SCH_M lock on the Metadata of the table in question.

The why:

Let’s go back to the setup, this time I’m going to make comments between the steps to help illustrate what’s going on.

I have a SSIS package and it runs a relatively basic ETL setup. We take data from TableA, move it to TableB.

This is on the primary node of a synchronous availability group, we have R/W privileges here.

TableB is a heap, we index it in our ETL and drop it afterwards.

Same deal, we are on the R/W connection in the AG.

Now we join TableB to TableC.

Rubber… meet road, road… meet rubber.

The primary AG node is having issues replicating the changes from node A to node B fast enough and is taking out a SCH_M lock on the table which is causing our error. The extract source is based off of the secondary node always, which is in synchronous mode.

Woah, wait! Your package succeeded the third time… but there were no rows? Your staging table must have had no records then.

Not quite, we had rows there. The table we were joining to had matches as well. We are querying the staging table on the secondary and the rows are not present.

Ok… that doesn’t make sense. You are in synchronous commit mode, node A persists as soon as node B does.

True, but there is a problem with this replication as per Microsoft and it is not error proof.

MSDN on why Availability Groups don’t love you anymore.

The client application completes an update on the primary replica successfully, but querying the secondary replica shows that the change is not reflected. This case assumes that your availability has a healthy synchronization state. In most cases, this behavior resolves itself after a few minutes.

If changes are still not reflected on the secondary replica after a few minutes, there may be a bottleneck in the synchronization work flow. The location of the bottleneck depends on whether the secondary replica is set to synchronous commit or asynchronous commit.

Those are the top paragraphs on that page. Ok… so that means it is possible that we read no rows from the secondary when they existed on the primary before the AG was able to replicate or lock the schema. Why?

Each successful update on the primary replica has already been synchronized to the secondary replica, or that the log records have already been flushed for hardening on the secondary replica. Therefore, the bottleneck should be in the redo process that happens after the log is flushed on the secondary replica.

Bottlenecks. Alright, what are the causes of bottle necks?

Well, the MSDN suggests that long running active transactions, high network latency or low throughput, another workload blocks the redo thread, or redo thread falls behind because of resource contention.

The solution:

Ultimately my solution was to change the data source from the read only secondary to the primary node assigned by the listener on a R/W connection. Now I don’t care if the secondary is lagging behind in this particular instance, but it looks like I have some query tuning to do!

I have added in a form for feedback, it would be very helpful if you would take a few minutes to fill this out. My goal is to adjust the content to best help others resolve their problems.

Share this:

Share this:

The Setup:

The error:

The where:

The why:

The solution:

Share this: