When monitoring SSIS package execution it is common to record the execution times of SSIS tasks, but this will only give insight into how long a task takes to complete and not the workload. Consider reviewing execution times of a data flow task only to find that the current execution takes an hour longer than all other previous runs. Is that a problem? Without a way to quantify the work done during a time span it is difficult, if not impossible, to compare the tasks solely based on execution time.
The ability to record the number of rows that move within a data flow task can be done using custom logging, please refer to this post, but can also be recorded in SSIS 2012 by configuring the logging level of the SSIS Catalog. One of the differences between deploying a project to the SSIS catalog and deploying a package to the file system or MSDB database is that project deployment can log events inherently while package deployment requires logging be configured during development of the package. The SSIS catalog supports 4 different logging levels, None, Basic, Performance, and Verbose, details of the logging levels can be found on Technet here.
In order to capture row counts within a data flow task you must change the default logging level from basic to verbose, basic logging will record all events except custom and diagnostic events and does not capture row counts within the data flow. This can be done by right clicking the SSIS Catalog in SQL Server management studio and changing the logging level or using T-SQL:
EXEC [SSISDB].[catalog].[configure_catalog] @property_name=N’SERVER_LOGGING_LEVEL’, @property_value=3GO
I deployed a project to my SSIS Catalog that has a package called SimpleDataFlow. This package has a single data flow task that takes the FirstName, MiddleName, and LastName from the AdventureWorks2012 database Person schema Person table and transfers the rows to a flat file.
Having worked with the Person schema Person table I know for a fact there are 19,972 rows, as does anyone who has attended any of my classes, what can I say? I LOVE THAT TABLE! Executing this package from the SSIS catalog I can query the internal.execution_data_statistics table and find the rows that passed from the source to the destination.
SELECT execution_id, package_name, source_component_name, destination_component_name, rows_sent, created_timeFROM [internal].[execution_data_statistics]
Reviewing the results may be a bit confusing and most often the first question I am asked is “I only ran the package once. Why does it show 4 executions?” In reality if you view the execution_id it is evident that there was only one execution, but the rows are counted as they move through the memory buffers from source to transformation to destination. You are able to get the sum of the rows that passed through the data flow by re-writing the query to use the SUM aggregate function.
SELECT SUM(rows_sent) rows_sent, execution_idFROM [internal].[execution_data_statistics]GROUP BY execution_id
Seems pretty easy right?? Let’s mix it up a bit and throw in a derived column transformation in the data flow between the source and destination. The transformation will concatenate the first name and last name and sent it to our flat file.
Re-deploy the project, re-execute the package and query the internal.execution_data_statistics table once again and things get a bit more confusing.
We now have 8 rows that when aggregated show a total of 39,944 rows_sent?? Modify the query to focus on the most recent execution, in this case execution_id 10, and add an ORDER BY clause to order based on the created_time and this will begin to make a bit more sense.
SELECT execution_id, package_name, source_component_name, destination_component_name, rows_sent, created_timeFROM [internal].[execution_data_statistics]WHERE execution_id = 10ORDER BY created_time
Now we are able to see that row one shows that 9,947 rows went from the OLE DB Source to the Derived Column Transformation. Row two shows that 9,947 rows went from the Derived Column transformation to the Names destination. You can follow the flow from buffer to buffer beginning at the source and following through to the destination. Trying to aggregate the rows_sent can be a bit tricky as now our row count is double, based on the fact that a transformation task was added.
You might think that this can easily be overcome by simply dividing the result by two. This will work in this specific instance but keep in mind that this IS NOT always the case. Some transformation will output more rows than are input, such as an unpivot transformation and possibly a script component, and some can output fewer rows than are input, such as a pivot transformation and possibly a script component. Using simple division in these cases does not show an accurate count of the rows that travelled from the component to component.
To insure that you show the rows that travelled from one component to the next just modify the query.
SELECT DISTINCT source_component_name source, destination_component_name destination, SUM(rows_sent) AS rows_sentFROM [internal].[execution_data_statistics]WHERE execution_id = 10GROUP BY source_component_name, destination_component_name
The results now accurately reflect the sum of the rows as they travelled from source to transformation to destination and provide you a quantitative measure to analyze data flow task execution times based on workload.
After getting you fired up about using the SSIS Catalog and its built in verbose logging I gotta let a little wind out of your sales. Verbose logging comes with overhead and should be limited to debugging and troubleshooting. Matt Mason wrote a blog about logging levels here, and he literally wrote the book.
The sample package can be downloaded here.