Databricks and ADLS Gen 2: Securing Your Data Lake for Internal Users

If you are implementing a modern data platform in Azure, you will most likely want to take advantage of Azure Data Lake Storage Gen 2, which offers reliable, secure, and cost effective data lake storage. Data Lakes are an integral component of the modern data platform, and as data lake technologies get more robust, they might rapidly become the main data store for all levels  of enterprise BI, machine learning, and data storage.

Azure offers a suite of tools to build and utilize the data lake – Azure Synapse, Azure Data Factory, and what we will be discussing here, Azure Databricks, are just a few.

“Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.”

Databricks and Spark integrate seamlessly with ADLS Gen 2, allowing your data users to build the ETL that will populate the data lake, use the lake to build dashboards and develop machine learning models, and query it for ad hoc insights.  If all of these personas of data users (data engineers, data engineers, and data analysts) are going to be accessing the data lake, you will need a way to secure it, especially if the data lake contains sensitive information.

Please note that this article assumes fundamental knowledge of Databricks and how tables in Databricks work. Here is a quick link to catch up, if you do not: https://docs.databricks.com/data/tables.html

Also note that the following might not touch on all levels of security requirements for the Data Lake and Databricks within Azure – just the connection between the two.  Ensure to consult your local network security architect to make sure the data lake / Databricks is secured within the proper subnet, has access control set up, etc.

The first decision to make is to determine the number of workspaces you want to set up.  Remember, workspaces are free – it is the clusters (compute) underneath that end up costing money.  So often, it will make sense to have two workspaces – one for ETL jobs (Data Engineers), and one for analytics / data science users, who will likely have less privilege’s than an automated ETL job. There are several reasons to keep these workspaces separate:

  • Security requirements differ – it usually makes sense to mount your data lake to the ETL cluster, so your ETL jobs automatically have the permissions they need without setting the context to connect at the beginning of every job.  However, that also allows anyone else in the workspace access to the data lake, so caution should be taken.
  • Modularization of CI / CD and maintenance – since the two workspaces have very different functions, it makes sense to logically separate them.
  • Cluster requirements will often be different between analytics users and ETL workspaces. You can have multiple clusters in a single workspace, but separating them eases maintenance.

Then of course, you should have different workspaces per environment – dev, test, stage, and prod.  CI / CD will be coming in another blog post, where I will go into more details on this.

So let’s assume we take the two workspace approach: One for ETL (Data Engineers usually have full access to the data lake), and one for Analytics Users / Ad Hoc querying.  You still have a decision to make: Do you enable Table Access Control, Azure Data Lake Storage Passthrough, or some combination of both, to secure what access they are given.

Table access control allows you to assign permissions on objects in the Hive metastore to groups within Databricks.  For example, let’s say you have created an ‘orders’ table on a set of files that is incrementally updated in the data lake.  This table would normally appear in the ‘Data’ tab in the Databricks workspace.

data tab databricks workspace

However, let’s say there was only one group of users who should be able to see this table.  You could create a group called ‘orders_users’, add all of the users to that group, and then GRANT read permissions to that group. Then, only that group could access that table.  By default, when you enable TAC, only admins and the user who created objects automatically are given access rights to those objects.  For all other users, you will need to explicitly grant permissions to either them individually, or to their respective groups (always go with groups if you can!).  There is also a limitation here: TAC only works with Python and SQL clusters – so if you are using Scala, this is not an option.

 

 

 

admin console.PNGIn my experience, TAC will work best if you have a group of analysts who are only querying a set of tables, or perhaps a ‘lake house’, and you want to have a workspace for them to work in where you can manage their permissions separate from the data lake.  These users would normally not require to access the data lake directly, so TAC is sufficient! It also works well when you have many tables defined on data that is outside of the data lake, since AD Passthrough only passes credentials through to the lake.

Read here on how to enable table access control: https://docs.databricks.com/administration-guide/access-control/table-acl.html

Azure AD Passthrough allows the Active Directory credential that users used when logging into Databricks to be passed through to the Data Lake, where you can also set ACLs .  This is a great option if you are using Azure Active Directory for identity management, because you only have to set ACLs in one place – on the data lake.  Then, you don’t need to worry about managing a separate set of permissions in Databricks as well. Also, with TAC control enabled, if a user has the data lake credential (Databricks secret / etc), and they know where to look in the data lake, they can still technically get to the data there.  Using ADLS Gen 2 ACLs and AD Passthrough prevents this ‘loophole’.

You can set ACLS at any object level within the data lake. For example, say you had 3 major data lake zones, raw, refined, and curated.  You can set different permissions for each of these zones, or you could even go into subdirectories of these zones, and define permissions there. You can also set permissions at the file system level, if you desire.

data lake acls.PNG

 

ACLS within ADLS Gen 2 can  get quite robust – but there is one thing to look out for.  These permissions will not inherit on files and sub-directories that are already created.  That is, if you decide to add a new permission to the raw zone of your data lake, but there is already a lot of data in that directory, the permissions will only inherit to NEW data added after the permissions were set up.  You will need to write a script to loop through the directories and ‘manually’ assign the permissions to all the files and folders below.   Thus, it is critical to get your data access personas defined early, so you can avoid this extra step down the line.

Additionally, let’s say you have a table ‘FinancialForecasts’ defined in Databricks on the curated zone of the data lake, that only a small group of people should have access to.  If I am an analytics users in the same workspace as the small group that DOES have access to this table, I will be able to ‘see’ the table metadata in the data tab within the Databricks workspace.  But if I am not granted permissions to the area in the data lake that the underlying data for this table lives, if I try to select from the table, I will get a permission error.  So, while I can see the metadata, I won’t be able to access the data.  Just something to consider.

Finally, you will need a Databricks premium plan to use this feature.  Likely, if you are implementing this at an enterprise level, you will want a premium plan for other features anyways.

Here is where you can enable AD Passthrough when creating a Databricks Cluster:

enable AD passthrough.PNG

Here is an example of what your production environment might look like:

sample workspace

As with all technology problems, there are many ways to accomplish the same goals.  So please take the above with a grain of salt, and think critically on the requirements for your data platform before making any major decisions!

As always – open to any discussion or suggestions, and happy to connect!

 

 

 

 

 

 

 

 

 

Your Report 2.0: A Data Engineer’s Guide to Migrating BI Reports from Old to New Systems: Part 4, Support and Enhancements

In the previous post, we discussed quality assurance, and all the steps involved in making sure the report is functioning to the businesses expectations. If you completed phase 3 with rigor, this next phase should be a cinch!  If you haven’t, please make sure to go back and read the previous 3 posts in the series.

  1. Phase 1: https://datastrides.wordpress.com/2019/02/02/your-report-2-0-a-data-engineers-guide-to-migrating-bi-reports-from-old-to-new-systems-part-1/
  2. Phase 2:https://datastrides.wordpress.com/2019/02/23/your-report-2-0-a-data-engineers-guide-to-migrating-bi-reports-from-old-to-new-systems-part-2-building-the-extract/
  3. Phase 3: https://datastrides.wordpress.com/2020/03/26/your-report-2-0-a-data-engineers-guide-to-migrating-bi-reports-from-old-to-new-systems-part-3-qa/

Congrats! The report is live, and you have succeeded migrating it! (Of course with the help of your dev-ops engineers, ensuring you have a seamless CI/CD process in place – this is for another blog post).  But before we close, there is one final phase that we cannot neglect: Supporting and enhancing the report.

No matter how diligent you are in writing good code, there are always issues that will arise and additional enhancements business users will want.  Thus, it is always a good idea to have a solid process in place for when you inevitably have to fix a bug or have an enhancement to make.

First we will discuss supporting the report, or in other words, bug resolution.

One scenario in this category is ‘the report is down’.  These are scary words, but make sure to stay calm, and go through the simple troubleshooting steps.

Find out exactly what this issue is.  In the eyes of a business user, ‘the report is down’ could mean anything from the report being unreachable via normal channels, to the data is not current. Remember there are a lot of steps that go into generating the report.  Some things to check:

  1. Can you reach the report?
  2. Is the reporting server down?
  3. Is the report deployed correctly?
  4. Is the extract failing?
  5. Is the ingestion of the extract failing?
  6. Is the ETL that populates the DataMart or Data Warehouse failing? This could account for missing data.

Once you can narrow down the issue you will have an easier time resolving it.  There are always to steps to take to resolve any problem.  Let’s say the ETL was failing because a table was not deployed properly in your previous release, thus the data was not current. Step 1: Hotfix the issue by manually deploying the table, and re-triggering the load.  Step 2: Determine the root cause of the issue.  WHY was it not deployed properly?  Until you fix the root issue, the problem could persist.  This may seem obvious, but sometime finding root issues can be difficult, so it is easy to chalk it up to a fluke.

The above situation is an ad-hoc / hotfix scenario.  However, the goal should be to catch errors or issues before the end users do.  In all of my ETL solutions, I like to have logging / error handling in as many places as possible.

  1. Source systems are down
  2. ETL into the Data Warehouse
  3. ETL into the Data Mart
  4. Extract into the report
  5. Report itself is down

points of failure2

If any of these steps fail, I would like emails automatically sent to me / the on call team.  This way, once you get an email like this, you can send a note out to the business users immediately notifying them of the issue. This pro-active approach will save you a lot of recourse from the business and will protect all of your downstream users from wasting time figuring out why things look strange.

The best thing you can do when there are issues with a report is communicate appropriately.  For the most part, people are understanding that things can go wrong.  But if you pretend things are ok, and people waste there time looking at bad data, the sentiment  can turn negative quickly.

The other type of common bugs are data related issues.  These can happen anywhere in our pipeline, and though it can be tedious, your best bet is to trace the data backwards until your find where the issue occurred.

Once the report is stable, the business users will begin to want additional enhancements to the report.  Whether this is as simple as adding an additional field, you should NEVER make changes directly in production, unless you are absolutely required to. As with all good software development processes, the goal will be to have some sort of source control (git) flow, that looks like the following:

  1. Dev – the environment where you make the changes and test them out.
  2. Test – the environment where QA and the end users test and verify the changes.
  3. Stage – production level data in a non-prod environment.
  4. Prod – Once code is thoroughly QA’d and approved it can go to prod.

release cycles

One challenge you might face here is that lower environments might not have the quality / quantity of data that production has. This will directly impact the way your reports look in lower environments.  This is especially prevalent in scenarios where the Warehouse or Data Mart is being developed at the same time as the report.  In this scenario, where data is paramount, I recommend having a fourth environment called Stage which houses all of the latest production data and is a final staging area for end users to test reports before they go to prod.  This assures that when the report goes to production, what the users saw in the stage environment will reflect exactly the same in production, because the underlying data is the same.

When discussing enhancements with the business, always have clear requirements and clear priorities, and as I have mentioned in the past, document everything!  Remember the detailed design we created earlier? Any enhancements should be documented thoroughly there before any coding begins. Basically, to add enhancements, go through the first 3 phases again, but stay especially focused on the new changes.

Hopefully, this will ensure smooth sailing as you continue to add new features

As with any technical solution, there are many ways to achieve the same result.  I hope this series conveyed not exactly how things have to be done, but rather provided a framework to approaching BI development.

Please feel free to comment if you have any additional thoughts or questions!

Your Report 2.0: A Data Engineer’s Guide to Migrating BI Reports from Old to New Systems: Part 3, QA

In the last post, we discussed actually building the data extract to be consumed by the reporting tool of choice.  If you haven’t read the previous posts, you can find them here:

  1. Phase 1:https://datastrides.wordpress.com/2019/02/02/your-report-2-0-a-data-engineers-guide-to-migrating-bi-reports-from-old-to-new-systems-part-1/
  2. Phase 2:https://datastrides.wordpress.com/2019/02/23/your-report-2-0-a-data-engineers-guide-to-migrating-bi-reports-from-old-to-new-systems-part-2-building-the-extract/

The next phase is crucial: Unit testing and QA.  If these steps are glazed over, it will result in many iterations of back and forth with business users, which is frustrating to all involved.  A wise data leader once described the data mantra he lives: “Get it working, get it right, get it fast”.  If the extract is running and producing data, it is ‘working’.  So now is time to get it ‘right’.

working

Before even running the extract and loading data into the reporting tool, I like to validate the data first by simply looking at it and doing some simple unit tests.  The first step to validating the data is attempting to find a source of truth.  Hopefully, if you are migrating an existing report, you can be provided with a historic version of that report and a data extract to compare to.  If this is a brand new report, or you are sourcing from a brand new system, you might have to work with the business to manually validate your first pass.  Once you have you source of truth, pick one specific record, or if it is an aggregated report, drop to the lowest granularity possible, and just make sure it looks everything looks good.  Pick a specific record by ID, and ask the data some questions: Is the record there?  Do all the fields match your source of truth? Do the datatypes and data formats match up (dates in the same format?).  If you are looking at an aggregate, are your totals at the lowest granularity accurate?  If not, you will have to dive in and take a look at how you are aggregating the data and see if something is missing, a calculation is off, or if your ETL is losing records or data. After you are happy with your first record, extend it to other records, and if you have other base ‘types’ or records, try those out as well.

During this time, you might discover bugs in the old report.  I have seen this frequently.  The best thing you can do in this scenario is to write up a detailed ‘report’ of the discrepancy, with 1-3 detailed examples and screenshots, and meet with the business to ‘prove’ your version is actually accurate. This will require stepping through the entire ETL process, from source to extract, so it can be quite time consuming. This happens more often than you would expect, so getting good at explaining these technical discrepancies to the business is an invaluable skill.

Once you have unit tested at the record level, you should unit test within a specific time range.  Some things to make sure: Are record counts matching? Do aggregates across a specific range work?  Basically, this is a good time to look at the data across a larger span, to make sure you have all the data you need.

After you have completed your unit testing, and you are pretty confident in the quality of your data extract, it is time to integrate with the reporting tool.  Ideally, this would be the time for a QA person to jump in and test out the report itself. It is always good to have a third party test your work, and remember not to take bugs personally!  They are completely unavoidable.  But if you do have to be your own QA, you will want to spin through the entire report.  Are the charts populating properly? Do all the filters and functions work? If you have historical data, and you can compare an old version of the report to the new version, now is the time to do it! Another piece of integration testing you can do is making sure the extract in the report works properly when you are doing incrementals. So if you have dates as a parameter in your extract, try running it multiple times with overlapping dates.  If you are getting duplicates, there is likely an issue in your source query itself, or in your merge statements.  Read a prior blog post here about identifying issues with incremental merges.

The final step in the process should always be business validation (User acceptance testing).  Once you are confident that the data looks good in the report, have a business user confirm them same.  Have them test it over a period of time (perhaps days), so you can confirm updates are all working properly, and that no issues might arise over time.  Once the business gives the green light, you are ready to push the report to production, thus beginning phase 4: Support and Enhancements.

tldr;

qa

Your Report 2.0: A Data Engineer’s Guide to Migrating BI Reports from Old to New Systems: Part 2, Building the Extract

Requirements in hand, you are ready to start the fun of actually building the report extract to match the data mapping contract document you produced earlier!

If you haven’t yet, I highly recommend reading part one of this series here: Part 1: Requirements Gathering

We left off completing the gap analysis for the report.  More than likely, this activity resulted in needing to make several changes in your new system to support the report.  For example, if you are sourcing from a newly built Data Warehouse, it is possible that the Data Warehouse does not have all the data that you need.  Thus, there might be enhancements required to the Warehouse before you can source the data.  Or, if you are sourcing from a Datamart that feeds off a Data Warehouse, you might need to modify the ETL that moves data to the Datamart.  However you source your data, the first step before building the extract is to get the data you need loaded to location you are going to source the report off of.

Sample Data Flow

Some other considerations at this point:

  1. Are your ETL’s running frequently enough to support reporting refresh requirements? For example, if your report requires to be refreshed hourly, but your ETLS that load the Datamart only run overnight, you are likely going to need to update the schedule.
  2. Are there any aggregations that should be performed in the ETL rather than in the report itself? If you have a blazing fast Data Warehouse, you should use it!
  3. Do you have all the historic data you need? Does the historic data change? If not, perhaps consider hosting a historic view of the data somewhere for speedier consumption by the reporting tool.  Or, just keeping that historic data in the tools data store itself.  We will discuss incrementals further in the next sections.
  4. Your goal is to get the data into the easiest and most simple form as possible for consumption into the report. Try to avoid using the reporting tool as another layer of ETL, as it wasn’t built for that purpose.

Once you have all the data in the final location you are going to source the report from, you are ready to write the actual report extract. If you have completed all the previous steps, this step should be quick and easy! Your report extract will consist of writing some sort of a view or stored procedure that exposes the data in the exact format the reporting tool requires.  The more you can expose the data to exactly what the report needs, the faster the report will run, and the snappier it will be on the front end.  I always suggest decoupling the extract from the source tables in the Data Warehouse or Data Mart, such that if you have to make changes to the model, it doesn’t affect the report extracts. You can do this easily in the form of a view or a stored procedure.

If you are going to run the extract for just certain period of time, or if the report you are building takes a historic look at the data for all time, you are likely going to want to implement the extract as an incremental procedure.  Elsewise, it will take an extremely long time for the extract to run, and even more time for the data to cross the internet and land in the report.  The easiest way (but not the only way, and there are always exceptions) to set up incrementals is the following:

  1. Make sure you have metadata on the tables in your Data Warehouse or Datamart.  At the least, this should include a ‘created date’ and a ‘last updated date time’.  This is a best practice for data management in general, not just for writing report incrementals.
  2. Pick your ‘main’ table / unique identifier.  What is the main table you want to drive the incrementals? For example, if you have a report that reports on Emergency Room visits, it will likely be your main ‘Visit’ table.
  3. Have the reporting tool pass your extract procedure the MAX last updated date that it has been sent by previous runs.
  4. Your procedure should take the MAX last updated date as a parameter, and only return records that have a ‘last updated date’ > the MAX of the last update.
  5. The report will have to do a merge into whatever data store it uses one the back-end. By merge, I mean if the record exists, update it, if not, insert it. There is SQL syntax for this that can be found here.

If you are getting merge errors in your incrementals, you might want to read my blog post on troubleshooting this issue:

https://datastrides.wordpress.com/2018/09/19/first-post-identifying-duplicates-and-fixing-incremental-merge-errors/

The benefits of incrementals are immense – it will result in much fast processing and snappier report refreshes. Note that if you don’t have the ability to do the merging on the reporting tool end, and you are not dealing with millions of records, you can do the merging into a permanent table on your Warehouse or Mart that will house this data in the format that the report requires.  Then you can have the report pull all of the data. This should only be done with reports that do not use large volumes of data.

As you are running through all of these considerations, be mindful of any changes you are making and how they will affect the reporting tool.  Sometimes the requirement is to not change the report code as much as possible, in which case you should strive to match the old process / extract as closely as possible.  This usually happens when you have a short time frame.  However, if there are performance improvements / bugs you can fix by using the above points, and you have the time, definitely consider them!

Also, ALWAYS include header comments / comment your code! This gives full visibility to what the code does, what changes were made, when, and why they were made.  I use something simple like this, but feel free to whatever fits your aesthetic:

---------------------------------------------------------------
--AUTHOR:  Ryan Kennedy
--CREATED: 02/16/2019
--DESC:    Exposes the data extract for the Very Important Business Report
--UPDT:
--[Ryan K - 2/17/19] Descriptive comment of changes made and why
---------------------------------------------------------------

Once the extract is ready to go, you can move on the Phase 3: Unit testing and QA. (Coming soon)

Your Report 2.0: A Data Engineer’s Guide to Migrating BI Reports from Old to New Systems: Part 1

As more and more companies go to the cloud, build data warehouse solutions, and become more data-centric, we as BI Developers and Data Engineers will find ourselves migrating reports to pull data from new systems, rather than old, deprecated ones.  Take a healthcare example: Say the healthcare company you work for runs their electronic medical records on Epic software, but they recently purchased several other hospitals, some of which use different software.   To report at an enterprise level, they build a data warehouse solution to bring all the data from these systems to one place.  Now, instead of sourcing reports out of all the systems separately, we want to pull all of the data from the data warehouse.

In this series of blog posts, I will walk through the process / some tips and tricks that I use when migrating a report from an old system to a new system. The main phases of this process are:

4_Phases

The first phase of this process should always be requirements gathering.  The better your requirements, the smoother the rest of the process will go.  I always like to start this process with a kickoff session, where I invite several different people who are all stakeholders in the success of this report migration:

  1. Business Users – Who are the business users who are going to be using this report regularly?
  2. Product / Area owners – is there a product owner who specializes in this subject matter? This person is usually an intermediary between the business owner and the BI team.
  3. BI Developer – If you aren’t the BI Developer yourself, you should definitely include whoever is going to be doing the front end of the report.
  4. Source System Data Specialist – This is rare to have, but if you do have someone like this to invite, it will definitely help!

In the kickoff session, I always try to discuss and accomplish the following:

  1. Timeline – roughly what is the requirement from a timeline perspective? How long will it take? Are there any hard deadlines?
  2. Functional explanation – get the business to explain what the report is, why they use it, and what decisions they make based off of it.  The better you can understand the business function behind the report, the easier it will be to understand the data.
  3. What are the different views in the report? Are there multiple tabs? One Tab? Are there aggregations? Have the business walk through the report and explain it.
  4. What is the time period the report shows? Is it just today’s data? Is it a rolling year? Is it monthly? Yearly?
  5. What level is the data at? Is it aggregated at a month level? What is the lowest granularity of the report? For example: In a report about financial transactions, is the report at the item level (eg – each line will represent one item that rolls up to a single ‘order’)? Or is it at the order level?
  6. Parameters and filters: Are the parameters passed into the report? What filters are used in the report? These are key data points you need to make sure are accurate.
  7. Schedule: How often does the report need to refresh? Daily, hourly, weekly? This will determine the schedule of the ETL’s, and if there are dependencies to your ETL’s, you will also have to take these into consideration.

After the kickoff, you will have a pretty solid understanding of the report.  At this point, there are some other considerations I like to explore and bring up if they seem relevant:

  • Is the report definition still valid? For example, if the report was built 15 years ago, it is very likely that some things have majorly changed in the interim.  Business users are often hesitant about changing reports, but it can’t hurt to ask the question.
  • Is all the data in the old report still used? We don’t want to be sending / including anything extraneous. Use this as an opportunity to trim things down.

Next comes documenting all of your findings. By the end of the requirements gathering phase, I always like to have two documents in hand.

The first is a design document.  Essentially, this document should consist of all the main points you gathered in the kickoff session.  Keep in mind that this should be a working document – You can have the business ‘sign it off’ in the beginning, but I always keep a ‘key decisions’ section that I update if we make any changes along the way, with a description of why the changes were made. This will allow you to keep a clear history of how the report changed over time and can always be referred to if you get any questions.

Next is a data contract / mapping document. The data contract should be given to you by the business / product manager.  The contract should contain a list of all the required fields, their data types, the level of the data, and the definitions of the fields. If they have mappings to old systems (tables / columns / etc), these should be provided as well.  If you cannot get a data contract, you may have to revert to reverse engineering the legacy report.  If this is the case, make sure to tack on extra time to your estimates!

Here is an example of what your report mapping could look like:

Mapping_Screenshot

Note: You can always switch the source system to be on the left, and the target system on the right, it is all a matter of preference.  The above is better if you might have multiple source systems going to one target system / report.  But at the end of the day, feel free to pick your poison!

With your data contract in hand, you can begin the crucial exercise of gap analysis, which is essentially mapping the fields in the data contract to the new source system. Do you have all the required fields at the required granularity? Are you missing anything? If so, you will need to work with the business to either find out how to source the data, or move forward without that data. Will you need to write / modify any of the ETL to support this new report?  The gap analysis is important, because if there is a lot of work involved in procuring the required data for the report, the whole process could take longer, and it is vital to set expectations with the business.

By the end of this phase, you should have your documents completed and agreed to by the business / product owner.  Of course things are likely to change, but it is always a good idea to have signed off documentation, so that if things do change, you can explain why.  Once you are set with your signed off documentation, you are ready for phase 2: Building the report!

I have included below some document templates to get started.  Enjoy!

Design Document Template

Mapping_Document_Template

My Top 5 Features of Azure Data Studio

I was recently at Azure Data Fest at the Boston Microsoft MTC Campus, and one of the tools they highlighted was Azure Data Studio. I took a spin through the tool afterwards and wanted to share what I found to be 5 of the most exciting features in Data Studio.

Before diving into my 5 – I want to preface this by saying that at the moment, Data Studio is not a complete replacement of SSMS.  There are definitely missing features (profiling is in preview, SQL Server Agent is in preview, etc).  However, over the coming year Microsoft has indicated they will continue to update the tool.  So for now, it will act as an addition tool in your toolbox, rather than a complete replacement of SSMS.

You can download and view the documentation here: https://docs.microsoft.com/en-us/sql/azure-data-studio/what-is?view=sql-server-2017

Alright, let’s get into it!

1. ‘Peek’ at Definition:

peekdefinition

Often when writing queries,  you need to see underlying table definitions in order to check out datatypes, see what columns exist, etc.  Normally this involves scrolling through the list of available tables to find the definition, which can be cumbersome in databases with thousands of tables.  However, in Data Studio, you can simply right click and select ‘View Definition’, and a modal pops up showing the table definition.  This is a nifty little feature that I will definitely use!

2. Quick translation of queries to charts

The limits with this feature are pretty extensive. There are many instances where I need to quickly get a some information to a business user, and have to first query the data, and then pop the data into another tool to create a graph.   Now you can do this pretty easily, all at once!  It is also useful when analyzing a dataset for the first time.  You can quickly see how many nulls do certain fields have, check out record distribution across certain categorical columns, etc.

3. Built in Source Control

I have HUNDREDS of queries that I use for day to day analysis and tasks, but are not technically part of a code base.  These are all saved on my desktop.  I do regular backups, but these queries are not saved in source control in any way, and  I would be pretty devastated if I lost all of them.  With built in Source Control, I can easily create a folder where I save all of these queries, and quickly check them in right through the UI. As a result, I will never lose these scripts!

Link: https://docs.microsoft.com/en-us/sql/azure-data-studio/source-control?view=sql-server-2017

4. Quick exporting to Json / Excel / Csv

Many times a business user will want an extract based on a query – with one click I can now easily extract this and send it along. Same goes with extracting to JSON, where you can quickly send JSON format data to any users / processes that require JSON formats.

5. Custom Insights and Dashboards

I think this feature has the most potential out of any of the features in Data Studio.  One use case that jumps out is a performance dashboard – where you can see number of active queries, any deadlocking issues, or any sort of custom query you have, and have those all on one customized Insight Dashboard.  Many of these features are available in the Azure portal, but Data Studio allows for full customization, which is paramount for custom systems.

Another case could be data quality checks – you can write some simple queries that do quality checks on your data, and display any bad records / results in a single place.  This is a quick and easy way to set up monitoring with little runway time.

Link: https://docs.microsoft.com/en-us/sql/azure-data-studio/insight-widgets?view=sql-server-2017

———————-

One thing I hope to see in the future on this platform is SQL queries in a ‘notebook’ form, similar to how python works in Jupityr notebooks.  I often have to query cross servers, and end up with many windows open.  If I could have one Notebook with different tiles that point to different servers, that would be an awesome feature to organizing queries and presenting business cases to users.

Web Scraping Ironman Triathlon Results to a CSV using BeatifulSoup and Pandas

Based on the name of my blog (DataStrides), and the contents of this post, you might be able to guess that I am an avid endurance athlete in my spare time.  So I hope to have many crossover posts that bring these two worlds together, because if there is one thing I have learned being a part of the endurance community, it is that they LOVE data.

Note: This post assumes a basic understanding of Python, HTML, and CSS.

So when would you want to use data scraping? Simply put, in any case where you want a data set that does not already exist in a clean and downloadable form, or through an API.

Let’s get started.  The first thing you will want to do is import your packages.  For this script we will use html from lxml, requests, bs4 (BeautifulSoup), and Pandas.

from lxml import html
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Scraper goes to starting URL, gathers all of the URLs for the different years IM results.
#Then, iterates through the pagination per race, and stores in a CSV / Pandas DF.

We will use html and requests to interact with the URLs / HTML code we are scraping from, BeautifulSoup is our scraping tool, and Pandas is used for data manipulation.

Next, we will pull in the starting link and transform the HTML from the site into a format that BeautifulSoup understands:

#Get starting URL in place.
url = 'http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?rd=20161008#axzz4rGjY7ruv'
response = requests.get(url)
html = response.content

Commonly, you will not have all the data you want for your dataset on a single page. You will need to loop through pages. We can do this pretty easily with arrays, for loops, and studying how the URLs are built into the pagination of the website.

Lets take a look at our starting page. You can see on the right hand side (highlighted) all links to the different pages of race results we want. If we inspect the element (F12, or right click > ‘Inspect Element’), we can see that the links are stored in an Unordered List (UL) with an ID of “raceResults”. BeautifulSoup essentially works by using HTML and CSS elements to know where to look on the web page, and then grab whatever you want from the structure.

2

#Get all race links.
raceLinks = []
for ul in soup.find_all('ul', {'id': 'raceResults'}):
    for link in ul.find_all('a', href=True):
        raceLinks.append(link['href'])

So essentially we go into the unordered list with an ID of raceResults, find all of the a tags (that store the links), and append the href value into our array.

There are some blank links we grab, and 2002 was giving us some issues, so lets remove them for now. We can always go get it manually later:

#get rid of blank links
raceLinks = [x for x in raceLinks if x]
#get rid of 2002 link, as results in bad format and break code.
raceLinksFin = raceLinks[0:14]  

Alright, so our URL list is clean. To add another layer of fun, each result set has N number of pages we will have to paginate through:

3

How will we get all of these links? You guessed it, more loops! So our outer loop will be to loop through the initial links we gathered, and our inner loop will be to loop through all of the pages. Note that all the code beyond this will be contained in this for loop.

#Loop through races to gather data
mDF = pd.DataFrame()
for race in raceLinksFin:

Then lets grab the date for the page we are looking for so we can tag the results in our pandas DataFrame later on.  You can do this by examining the URL and seeing where the date is in the string.

 #get part of URL that corresponds to date to identify which year results belong to
    date = race[92:]

Next we will use our packages to grab the number of pages or results for that specific race.

 #Get number of pages
    response = requests.get(race)
    html = response.content
    soup = BeautifulSoup(html, 'lxml')
    numberOfPages = []
    for div in soup.find_all('div', {'class': 'pagination'}):
        for span in div.find_all('span'):
            numberOfPages.append(span.get_text())

Again, we have to clean up our result set to get rid of some unwanted data. You will always want to be printing out what you are scraping during the debugging of the script so you can see what is going on.

 #clean non-numerics from list of gathered data
    cleaned = [ x for x in numberOfPages if x.isdigit() ]
    #convert to int
    ints = [ int(x) for x in cleaned ]
    #get max page numbers
    maxPages = max(ints)

So now we have the number of pages associated with that given year of results. How can we now loop through these to scrape the data? Lets examine the links for the first few paginations:
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?rd=20161008#axzz4rGjY7ruv
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?p=2&rd=20161008&ps=20#axzz5VEXRJnbj
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?p=3&rd=20161008&ps=20#axzz5VEXRJnbj
http://www.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx?p=4&rd=20161008&ps=20

We can see that the first URL is different, but for 2,3, and 4, the only thing that is changing is what I made bold above. So all we need to do is modify our base URL with the new page up to the max number of pages, and we have our URLS!

As we saw, the first URL is different, so we need to handle it differently.  Feel free to look at the last screenshot which shows the HTML we are scraping.

#Starting URL, as route is different for page 1
    for div in soup.find_all('div', {'class': 'pagination'}):
        for link in div.find_all('a', href=True):
            firstLink = link['href']
            
            #build link route to loop through paginated pages
            part1 = firstLink[0:91]
            part2 = firstLink[92:]

Alright so now we have the ‘base’ link that we can loop through by doing some string manipulations to the URL and cutting out the part that references the page number. Now we get to use a super useful Pandas function called read_html, which pulls HTML tables into a list if Dataframe objects. Simply put, it makes it very simple to pull tabular data from the internet! As with BeautifulSoup, we can use the HTML / CSS attributes to specify that we want the table with the id of ‘eventResults’.

#Get data from page 1
    df = pd.DataFrame()
    df_intial = pd.read_html(race, attrs = {'id': 'eventResults'})

4

Then we can append the data to the data frame we initialize:

#append data to dataframe, adding date to identify race year
    df = df.append(df_intial)
    df['Date'] = date
    mDF = mDF.append(df)

Finally, we loop through all the remaining racelinks we gathered for this specific year, append them to the dataset, and once all data is gathered, we go to our outer loop and move on to the next year.

#Loop through remaining pages
    df_2 = pd.DataFrame()
    i = 2
    while i < maxPages: #116
        securl = part1 + str(i) + part2  
        print(securl)
        df_temp = pd.read_html(securl, attrs = {'id': 'eventResults'})
        #print(df)
        df_2 = df_2.append(df_temp)
        #print(df)
        i = i + 1
    df_2['Date'] = date
    mDF = mDF.append(df_2)

Once these loops finish, all of our data is in the mDF DataFrame. From here, we can start doing some analysis on the DataFrame, or we can use one quick line of code to write our DataFrame to a CSV.

#write to a csv, or output to     
mDF.to_csv('results.csv')

So that’s that, you have your data! If you are trying to scrape some less dynamic web pages, the Pandas read_html can be a super quick way to scrape some data down.

As always, feel free to reach out with any comments or questions!

Here is the Github link to the code: https://github.com/OnyxEndurance/ironScraper