Figured I would start small with my first post, and share a simple TSQL query that I often find myself using to identify duplicates. If you have ever written an incremental merge script, you have likely gotten an error like the following:
What this essentially means is that either in your target dataset or in your source dataset / query, you have a duplicate record on the key field you are merging on. Thus, the code is trying to update the same record multiple times, which causes an error. Take this simple table:
As you can see, there are duplicate OrderIDs in the above table. The code snippet below is one very quick way to identify duplicates, and will return any OrderIDs that are duplicated. You can run it on either the source or the target, just make sure whatever you use in the group by is the key you are merging on, or the field that you are trying to identify as duplicated.
SELECT OrderID FROM #Orders GROUP BY OrderID HAVING COUNT(OrderID) > 1
You now know that OrderID 1 is duplicated in the table. If you want to find duplicates on more than one column, just add the additional columns to the select and Group By statements.
SELECT OrderID, ItemID FROM #Orders GROUP BY OrderID, ItemID HAVING COUNT(OrderID) > 1
Now we know there are multiple records with OrderID of 1 and ItemID of 123. Then, you can take one of the returned OrderIDs, and try to analyze why this row was duplicated. Run a simple select (below) on the location of the duplicates using one of the keys you identified above to check out the data. This will return the duplicated data.
SELECT * FROM #Orders where OrderID = 1 --This is an identified duplicate.
You can tell if the rows are exact duplicates by adding DISTINCT to the above query. If only one row is returned, you know the records are exactly the same, not just two rows with the same key.
SELECT DISTINCT * FROM #Orders WHERE OrderID = 1
Since one row disappeared out of the 3, it means two of the rows were exact duplicates. Your next step is to figure out why you are getting them in your result set, and whether it is valid to have duplicates on this key. If you are doing a simple select from a table, it means your raw table has duplicates. This becomes a question of how the table was populated, and whether or not duplicates should be allowed in this table. The above example could either be that one person bought two helmets in a single order, or there could be some sort of bug in whatever code populates the table. The resolution will all depend on the granularity, or ‘level’ of the data you are trying to get, which I will discuss more later.
If the rows are not exact duplicates, then it means that some field is changing over the result set. Again, the first step is to try to figure out if these duplicates are valid. If your ETL source is a complex query, it could be an N:1 join which is causing duplication. In order to find the culprit, your goal should be to identify which column is changing through the rows of your SELECT * query. Once you find the field, it is likely the join to get that field, or just the nature of the raw table which is causing the duplication. Above, we can identify the ‘ItemID’ as the culprit. It is likely that you have an N:1 relationship, where many items can be in one order. However, given the above data, we also see that ItemID and OrderID can also be duplicated, so simply adding the ItemID to the MERGE key wouldn’t solve the problem. You would have to find some other field.
Long story short – if you are doing incremental ETL, Always make sure your MERGE Key is unique! While you are doing your analysis, you should not just add fields to the merge key to make it unique. Everything should be done purposefully, and you should always be mindful of what ‘Level’ or granularity your data is at. Lets take a real world healthcare scenario:
Say you are sourcing your data from an EMR system (Electronic Medical Record). You might find a many to one relationship between procedures undertaken in a surgery and the surgery itself. That is, there can be multiple procedures done in a single surgery. When designing your ETL and your data model, you need to decide what ‘level’ you want the surgeries to exist at. Should your target table be at the ‘Surgery’ level, or the ‘Procedure’ level? Take this data set:
- Do you want to roll them up into one record, and expand your columns to have additional fields for each procedure? This would be the surgery level, but likely won’t work unless you can only have a set X number of procedures associated to a surgery. This is generally not a good idea.
- Do you want to just insert the ‘primary’ procedure? Is there some bit flag on the source table so you can filter on just these procedures? This is again the surgery level.
- Do you want to add the ProcedureID to the merge key / primary key so you can keep the records at the procedure level and avoid issues on the merge key? This would keep the data at the procedure level. Your code might look something like this:
MERGE [datawarehouse].[Surgeries] AS TARGET USING ( SELECT [SurgeryID] ,[ProcedureID] ,[SurgeryDate] ,[IsPrimary] FROM [emr].[SurgeryActuals] ) AS Source ON Target.[SurgeryID] = Source.[SurgeryID] AND Target.[ProcedureID] = Source.[ProcedureID] WHEN MATCHED THEN UPDATE SET Target.[SurgeryID] = Source.[SurgeryID] ,Target.[ProcedureID] = Source.[ProcedureID] ,Target.[SurgeryDate] = Source.[SurgeryDate] ,Target.[IsPrimary] = Source.[IsPrimary] WHEN NOT MATCHED THEN INSERT VALUES ( Source.[SurgeryID] ,Source.[ProcedureID] ,Source.[SurgeryDate] ,Source.[IsPrimary] );
As always, it will likely be up to the data consumers for what information they want displayed. But when designing a Data Warehouse or Data Mart, I always err on the side of caution and include more data rather than less. It can always be filtered down later.
Feel free to comment with any other tips / tricks / experience you have had with the above!