Your Report 2.0: A Data Engineer’s Guide to Migrating BI Reports from Old to New Systems: Part 2, Building the Extract

Requirements in hand, you are ready to start the fun of actually building the report extract to match the data mapping contract document you produced earlier!

If you haven’t yet, I highly recommend reading part one of this series here: Part 1: Requirements Gathering

We left off completing the gap analysis for the report.  More than likely, this activity resulted in needing to make several changes in your new system to support the report.  For example, if you are sourcing from a newly built Data Warehouse, it is possible that the Data Warehouse does not have all the data that you need.  Thus, there might be enhancements required to the Warehouse before you can source the data.  Or, if you are sourcing from a Datamart that feeds off a Data Warehouse, you might need to modify the ETL that moves data to the Datamart.  However you source your data, the first step before building the extract is to get the data you need loaded to location you are going to source the report off of.

Sample Data Flow

Some other considerations at this point:

  1. Are your ETL’s running frequently enough to support reporting refresh requirements? For example, if your report requires to be refreshed hourly, but your ETLS that load the Datamart only run overnight, you are likely going to need to update the schedule.
  2. Are there any aggregations that should be performed in the ETL rather than in the report itself? If you have a blazing fast Data Warehouse, you should use it!
  3. Do you have all the historic data you need? Does the historic data change? If not, perhaps consider hosting a historic view of the data somewhere for speedier consumption by the reporting tool.  Or, just keeping that historic data in the tools data store itself.  We will discuss incrementals further in the next sections.
  4. Your goal is to get the data into the easiest and most simple form as possible for consumption into the report. Try to avoid using the reporting tool as another layer of ETL, as it wasn’t built for that purpose.

Once you have all the data in the final location you are going to source the report from, you are ready to write the actual report extract. If you have completed all the previous steps, this step should be quick and easy! Your report extract will consist of writing some sort of a view or stored procedure that exposes the data in the exact format the reporting tool requires.  The more you can expose the data to exactly what the report needs, the faster the report will run, and the snappier it will be on the front end.  I always suggest decoupling the extract from the source tables in the Data Warehouse or Data Mart, such that if you have to make changes to the model, it doesn’t affect the report extracts. You can do this easily in the form of a view or a stored procedure.

If you are going to run the extract for just certain period of time, or if the report you are building takes a historic look at the data for all time, you are likely going to want to implement the extract as an incremental procedure.  Elsewise, it will take an extremely long time for the extract to run, and even more time for the data to cross the internet and land in the report.  The easiest way (but not the only way, and there are always exceptions) to set up incrementals is the following:

  1. Make sure you have metadata on the tables in your Data Warehouse or Datamart.  At the least, this should include a ‘created date’ and a ‘last updated date time’.  This is a best practice for data management in general, not just for writing report incrementals.
  2. Pick your ‘main’ table / unique identifier.  What is the main table you want to drive the incrementals? For example, if you have a report that reports on Emergency Room visits, it will likely be your main ‘Visit’ table.
  3. Have the reporting tool pass your extract procedure the MAX last updated date that it has been sent by previous runs.
  4. Your procedure should take the MAX last updated date as a parameter, and only return records that have a ‘last updated date’ > the MAX of the last update.
  5. The report will have to do a merge into whatever data store it uses one the back-end. By merge, I mean if the record exists, update it, if not, insert it. There is SQL syntax for this that can be found here.

If you are getting merge errors in your incrementals, you might want to read my blog post on troubleshooting this issue:

https://datastrides.wordpress.com/2018/09/19/first-post-identifying-duplicates-and-fixing-incremental-merge-errors/

The benefits of incrementals are immense – it will result in much fast processing and snappier report refreshes. Note that if you don’t have the ability to do the merging on the reporting tool end, and you are not dealing with millions of records, you can do the merging into a permanent table on your Warehouse or Mart that will house this data in the format that the report requires.  Then you can have the report pull all of the data. This should only be done with reports that do not use large volumes of data.

As you are running through all of these considerations, be mindful of any changes you are making and how they will affect the reporting tool.  Sometimes the requirement is to not change the report code as much as possible, in which case you should strive to match the old process / extract as closely as possible.  This usually happens when you have a short time frame.  However, if there are performance improvements / bugs you can fix by using the above points, and you have the time, definitely consider them!

Also, ALWAYS include header comments / comment your code! This gives full visibility to what the code does, what changes were made, when, and why they were made.  I use something simple like this, but feel free to whatever fits your aesthetic:

---------------------------------------------------------------
--AUTHOR:  Ryan Kennedy
--CREATED: 02/16/2019
--DESC:    Exposes the data extract for the Very Important Business Report
--UPDT:
--[Ryan K - 2/17/19] Descriptive comment of changes made and why
---------------------------------------------------------------

Once the extract is ready to go, you can move on the Phase 3: Unit testing and QA. (Coming soon)

First Post – Identifying Duplicates and Fixing Incremental Merge Errors

Figured I would start small with my first post, and share a simple TSQL query that I often find myself using to identify duplicates. If you have ever written an incremental merge script, you have likely gotten an error like the following:

merge_error

What this essentially means is that either in your target dataset or in your source dataset / query, you have a duplicate record on the key field you are merging on.  Thus, the code is trying to update the same record multiple times, which causes an error.  Take this simple table:

base_table

As you can see, there are duplicate OrderIDs in the above table. The code snippet below is one very quick way to identify duplicates, and will return any OrderIDs that are duplicated. You can run it on either the source or the target, just make sure whatever you use in the group by is the key you are merging on, or the field that you are trying to identify as duplicated.

SELECT OrderID
FROM #Orders
GROUP BY OrderID
HAVING COUNT(OrderID) > 1

identify_duplicates

You now know that OrderID 1 is duplicated in the table. If you want to find duplicates on more than one column, just add the additional columns to the select and Group By statements.

SELECT OrderID, ItemID
FROM #Orders
GROUP BY OrderID, ItemID
HAVING COUNT(OrderID) > 1

identify_duplicates_multiple

Now we know there are multiple records with OrderID of 1 and ItemID of 123.  Then, you can take one of the returned OrderIDs, and try to analyze why this row was duplicated. Run a simple select (below) on the location of the duplicates using one of the keys you identified above to check out the data.  This will return the duplicated data.

SELECT *
FROM #Orders
where OrderID = 1 --This is an identified duplicate. 

select_duplicates

You can tell if the rows are exact duplicates by adding DISTINCT to the above query.  If only one row is returned, you know the records are exactly the same, not just two rows with the same key.

SELECT DISTINCT *
FROM #Orders
WHERE OrderID = 1

select_distinct

Since one row disappeared out of the 3, it means two of the rows were exact duplicates. Your next step is to figure out why you are getting them in your result set, and whether it is valid to have duplicates on this key.  If you are doing a simple select from a table, it means your raw table has duplicates.  This becomes a question of how the table was populated, and whether or not duplicates should be allowed in this table. The above example could either be that one person bought two helmets in a single order, or there could be some sort of bug in whatever code populates the table.  The resolution will all depend on the granularity, or ‘level’ of the data you are trying to get, which I will discuss more later.

If the rows are not exact duplicates, then it means that some field is changing over the result set.  Again, the first step is to try to figure out if these duplicates are valid.  If your ETL source is a complex query, it could be an N:1 join which is causing duplication. In order to find the culprit, your goal should be to identify which column is changing through the rows of your SELECT * query.  Once you find the field, it is likely the join to get that field, or just the nature of the raw table which is causing the duplication. Above, we can identify the ‘ItemID’ as the culprit.  It is likely that you have an N:1 relationship, where many items can be in one order. However, given the above data, we also see that ItemID and OrderID can also be duplicated, so simply adding the ItemID to the MERGE key wouldn’t solve the problem.  You would have to find some other field.

Long story short – if you are doing incremental ETL, Always make sure your MERGE Key is unique!  While you are doing your analysis, you should not just add fields to the merge key to make it unique.  Everything should be done purposefully, and you should always be mindful of what ‘Level’ or granularity your data is at. Lets take a real world healthcare scenario:

Say you are sourcing your data from an EMR system (Electronic Medical Record).  You might find a many to one relationship between procedures undertaken in a surgery and the surgery itself.  That is, there can be multiple procedures done in a single surgery.  When designing your ETL and your data model, you need to decide what ‘level’ you want the surgeries to exist at. Should your target table be at the ‘Surgery’ level, or the ‘Procedure’ level? Take this data set:

surgeries_select

  • Do you want to roll them up into one record, and expand your columns to have additional fields for each procedure? This would be the surgery level, but likely won’t work unless you can only have a set X number of procedures associated to a surgery.  This is generally not a good idea.

surgeries_pivot

 

  • Do you want to just insert the ‘primary’ procedure? Is there some bit flag on the source table so you can filter on just these procedures? This is again the surgery level.

surgeries_isPrimary

  • Do you want to add the ProcedureID to the merge key / primary key so you can keep the records at the procedure level and avoid issues on the merge key? This would keep the data at the procedure level. Your code might look something like this:
MERGE [datawarehouse].[Surgeries] AS TARGET
USING (
	SELECT [SurgeryID]
		,[ProcedureID]
		,[SurgeryDate]
		,[IsPrimary]
	FROM [emr].[SurgeryActuals]
	) AS Source
	ON Target.[SurgeryID] = Source.[SurgeryID]
		AND Target.[ProcedureID] = Source.[ProcedureID]
WHEN MATCHED
	THEN
		UPDATE
		SET Target.[SurgeryID] = Source.[SurgeryID]
			,Target.[ProcedureID] = Source.[ProcedureID]
			,Target.[SurgeryDate] = Source.[SurgeryDate]
			,Target.[IsPrimary] = Source.[IsPrimary]
WHEN NOT MATCHED
	THEN
		INSERT
		VALUES (
			Source.[SurgeryID]
			,Source.[ProcedureID]
			,Source.[SurgeryDate]
			,Source.[IsPrimary]
			);

As always, it will likely be up to the data consumers for what information they want displayed.  But when designing a Data Warehouse or Data Mart, I always err on the side of caution and include more data rather than less.  It can always be filtered down later.

Feel free to comment with any other tips / tricks / experience you have had with the above!