Your Report 2.0: A Data Engineer’s Guide to Migrating BI Reports from Old to New Systems: Part 2, Building the Extract

Requirements in hand, you are ready to start the fun of actually building the report extract to match the data mapping contract document you produced earlier!

If you haven’t yet, I highly recommend reading part one of this series here: Part 1: Requirements Gathering

We left off completing the gap analysis for the report.  More than likely, this activity resulted in needing to make several changes in your new system to support the report.  For example, if you are sourcing from a newly built Data Warehouse, it is possible that the Data Warehouse does not have all the data that you need.  Thus, there might be enhancements required to the Warehouse before you can source the data.  Or, if you are sourcing from a Datamart that feeds off a Data Warehouse, you might need to modify the ETL that moves data to the Datamart.  However you source your data, the first step before building the extract is to get the data you need loaded to location you are going to source the report off of.

Sample Data Flow

Some other considerations at this point:

  1. Are your ETL’s running frequently enough to support reporting refresh requirements? For example, if your report requires to be refreshed hourly, but your ETLS that load the Datamart only run overnight, you are likely going to need to update the schedule.
  2. Are there any aggregations that should be performed in the ETL rather than in the report itself? If you have a blazing fast Data Warehouse, you should use it!
  3. Do you have all the historic data you need? Does the historic data change? If not, perhaps consider hosting a historic view of the data somewhere for speedier consumption by the reporting tool.  Or, just keeping that historic data in the tools data store itself.  We will discuss incrementals further in the next sections.
  4. Your goal is to get the data into the easiest and most simple form as possible for consumption into the report. Try to avoid using the reporting tool as another layer of ETL, as it wasn’t built for that purpose.

Once you have all the data in the final location you are going to source the report from, you are ready to write the actual report extract. If you have completed all the previous steps, this step should be quick and easy! Your report extract will consist of writing some sort of a view or stored procedure that exposes the data in the exact format the reporting tool requires.  The more you can expose the data to exactly what the report needs, the faster the report will run, and the snappier it will be on the front end.  I always suggest decoupling the extract from the source tables in the Data Warehouse or Data Mart, such that if you have to make changes to the model, it doesn’t affect the report extracts. You can do this easily in the form of a view or a stored procedure.

If you are going to run the extract for just certain period of time, or if the report you are building takes a historic look at the data for all time, you are likely going to want to implement the extract as an incremental procedure.  Elsewise, it will take an extremely long time for the extract to run, and even more time for the data to cross the internet and land in the report.  The easiest way (but not the only way, and there are always exceptions) to set up incrementals is the following:

  1. Make sure you have metadata on the tables in your Data Warehouse or Datamart.  At the least, this should include a ‘created date’ and a ‘last updated date time’.  This is a best practice for data management in general, not just for writing report incrementals.
  2. Pick your ‘main’ table / unique identifier.  What is the main table you want to drive the incrementals? For example, if you have a report that reports on Emergency Room visits, it will likely be your main ‘Visit’ table.
  3. Have the reporting tool pass your extract procedure the MAX last updated date that it has been sent by previous runs.
  4. Your procedure should take the MAX last updated date as a parameter, and only return records that have a ‘last updated date’ > the MAX of the last update.
  5. The report will have to do a merge into whatever data store it uses one the back-end. By merge, I mean if the record exists, update it, if not, insert it. There is SQL syntax for this that can be found here.

If you are getting merge errors in your incrementals, you might want to read my blog post on troubleshooting this issue:

https://datastrides.wordpress.com/2018/09/19/first-post-identifying-duplicates-and-fixing-incremental-merge-errors/

The benefits of incrementals are immense – it will result in much fast processing and snappier report refreshes. Note that if you don’t have the ability to do the merging on the reporting tool end, and you are not dealing with millions of records, you can do the merging into a permanent table on your Warehouse or Mart that will house this data in the format that the report requires.  Then you can have the report pull all of the data. This should only be done with reports that do not use large volumes of data.

As you are running through all of these considerations, be mindful of any changes you are making and how they will affect the reporting tool.  Sometimes the requirement is to not change the report code as much as possible, in which case you should strive to match the old process / extract as closely as possible.  This usually happens when you have a short time frame.  However, if there are performance improvements / bugs you can fix by using the above points, and you have the time, definitely consider them!

Also, ALWAYS include header comments / comment your code! This gives full visibility to what the code does, what changes were made, when, and why they were made.  I use something simple like this, but feel free to whatever fits your aesthetic:

---------------------------------------------------------------
--AUTHOR:  Ryan Kennedy
--CREATED: 02/16/2019
--DESC:    Exposes the data extract for the Very Important Business Report
--UPDT:
--[Ryan K - 2/17/19] Descriptive comment of changes made and why
---------------------------------------------------------------

Once the extract is ready to go, you can move on the Phase 3: Unit testing and QA. (Coming soon)

Your Report 2.0: A Data Engineer’s Guide to Migrating BI Reports from Old to New Systems: Part 1

As more and more companies go to the cloud, build data warehouse solutions, and become more data-centric, we as BI Developers and Data Engineers will find ourselves migrating reports to pull data from new systems, rather than old, deprecated ones.  Take a healthcare example: Say the healthcare company you work for runs their electronic medical records on Epic software, but they recently purchased several other hospitals, some of which use different software.   To report at an enterprise level, they build a data warehouse solution to bring all the data from these systems to one place.  Now, instead of sourcing reports out of all the systems separately, we want to pull all of the data from the data warehouse.

In this series of blog posts, I will walk through the process / some tips and tricks that I use when migrating a report from an old system to a new system. The main phases of this process are:

4_Phases

The first phase of this process should always be requirements gathering.  The better your requirements, the smoother the rest of the process will go.  I always like to start this process with a kickoff session, where I invite several different people who are all stakeholders in the success of this report migration:

  1. Business Users – Who are the business users who are going to be using this report regularly?
  2. Product / Area owners – is there a product owner who specializes in this subject matter? This person is usually an intermediary between the business owner and the BI team.
  3. BI Developer – If you aren’t the BI Developer yourself, you should definitely include whoever is going to be doing the front end of the report.
  4. Source System Data Specialist – This is rare to have, but if you do have someone like this to invite, it will definitely help!

In the kickoff session, I always try to discuss and accomplish the following:

  1. Timeline – roughly what is the requirement from a timeline perspective? How long will it take? Are there any hard deadlines?
  2. Functional explanation – get the business to explain what the report is, why they use it, and what decisions they make based off of it.  The better you can understand the business function behind the report, the easier it will be to understand the data.
  3. What are the different views in the report? Are there multiple tabs? One Tab? Are there aggregations? Have the business walk through the report and explain it.
  4. What is the time period the report shows? Is it just today’s data? Is it a rolling year? Is it monthly? Yearly?
  5. What level is the data at? Is it aggregated at a month level? What is the lowest granularity of the report? For example: In a report about financial transactions, is the report at the item level (eg – each line will represent one item that rolls up to a single ‘order’)? Or is it at the order level?
  6. Parameters and filters: Are the parameters passed into the report? What filters are used in the report? These are key data points you need to make sure are accurate.
  7. Schedule: How often does the report need to refresh? Daily, hourly, weekly? This will determine the schedule of the ETL’s, and if there are dependencies to your ETL’s, you will also have to take these into consideration.

After the kickoff, you will have a pretty solid understanding of the report.  At this point, there are some other considerations I like to explore and bring up if they seem relevant:

  • Is the report definition still valid? For example, if the report was built 15 years ago, it is very likely that some things have majorly changed in the interim.  Business users are often hesitant about changing reports, but it can’t hurt to ask the question.
  • Is all the data in the old report still used? We don’t want to be sending / including anything extraneous. Use this as an opportunity to trim things down.

Next comes documenting all of your findings. By the end of the requirements gathering phase, I always like to have two documents in hand.

The first is a design document.  Essentially, this document should consist of all the main points you gathered in the kickoff session.  Keep in mind that this should be a working document – You can have the business ‘sign it off’ in the beginning, but I always keep a ‘key decisions’ section that I update if we make any changes along the way, with a description of why the changes were made. This will allow you to keep a clear history of how the report changed over time and can always be referred to if you get any questions.

Next is a data contract / mapping document. The data contract should be given to you by the business / product manager.  The contract should contain a list of all the required fields, their data types, the level of the data, and the definitions of the fields. If they have mappings to old systems (tables / columns / etc), these should be provided as well.  If you cannot get a data contract, you may have to revert to reverse engineering the legacy report.  If this is the case, make sure to tack on extra time to your estimates!

Here is an example of what your report mapping could look like:

Mapping_Screenshot

Note: You can always switch the source system to be on the left, and the target system on the right, it is all a matter of preference.  The above is better if you might have multiple source systems going to one target system / report.  But at the end of the day, feel free to pick your poison!

With your data contract in hand, you can begin the crucial exercise of gap analysis, which is essentially mapping the fields in the data contract to the new source system. Do you have all the required fields at the required granularity? Are you missing anything? If so, you will need to work with the business to either find out how to source the data, or move forward without that data. Will you need to write / modify any of the ETL to support this new report?  The gap analysis is important, because if there is a lot of work involved in procuring the required data for the report, the whole process could take longer, and it is vital to set expectations with the business.

By the end of this phase, you should have your documents completed and agreed to by the business / product owner.  Of course things are likely to change, but it is always a good idea to have signed off documentation, so that if things do change, you can explain why.  Once you are set with your signed off documentation, you are ready for phase 2: Building the report!

I have included below some document templates to get started.  Enjoy!

Design Document Template

Mapping_Document_Template