If you are implementing a modern data platform in Azure, you will most likely want to take advantage of Azure Data Lake Storage Gen 2, which offers reliable, secure, and cost effective data lake storage. Data Lakes are an integral component of the modern data platform, and as data lake technologies get more robust, they might rapidly become the main data store for all levels of enterprise BI, machine learning, and data storage.
Azure offers a suite of tools to build and utilize the data lake – Azure Synapse, Azure Data Factory, and what we will be discussing here, Azure Databricks, are just a few.
“Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.”
Databricks and Spark integrate seamlessly with ADLS Gen 2, allowing your data users to build the ETL that will populate the data lake, use the lake to build dashboards and develop machine learning models, and query it for ad hoc insights. If all of these personas of data users (data engineers, data engineers, and data analysts) are going to be accessing the data lake, you will need a way to secure it, especially if the data lake contains sensitive information.
Please note that this article assumes fundamental knowledge of Databricks and how tables in Databricks work. Here is a quick link to catch up, if you do not: https://docs.databricks.com/data/tables.html
Also note that the following might not touch on all levels of security requirements for the Data Lake and Databricks within Azure – just the connection between the two. Ensure to consult your local network security architect to make sure the data lake / Databricks is secured within the proper subnet, has access control set up, etc.
The first decision to make is to determine the number of workspaces you want to set up. Remember, workspaces are free – it is the clusters (compute) underneath that end up costing money. So often, it will make sense to have two workspaces – one for ETL jobs (Data Engineers), and one for analytics / data science users, who will likely have less privilege’s than an automated ETL job. There are several reasons to keep these workspaces separate:
- Security requirements differ – it usually makes sense to mount your data lake to the ETL cluster, so your ETL jobs automatically have the permissions they need without setting the context to connect at the beginning of every job. However, that also allows anyone else in the workspace access to the data lake, so caution should be taken.
- Modularization of CI / CD and maintenance – since the two workspaces have very different functions, it makes sense to logically separate them.
- Cluster requirements will often be different between analytics users and ETL workspaces. You can have multiple clusters in a single workspace, but separating them eases maintenance.
Then of course, you should have different workspaces per environment – dev, test, stage, and prod. CI / CD will be coming in another blog post, where I will go into more details on this.
So let’s assume we take the two workspace approach: One for ETL (Data Engineers usually have full access to the data lake), and one for Analytics Users / Ad Hoc querying. You still have a decision to make: Do you enable Table Access Control, Azure Data Lake Storage Passthrough, or some combination of both, to secure what access they are given.
Table access control allows you to assign permissions on objects in the Hive metastore to groups within Databricks. For example, let’s say you have created an ‘orders’ table on a set of files that is incrementally updated in the data lake. This table would normally appear in the ‘Data’ tab in the Databricks workspace.
However, let’s say there was only one group of users who should be able to see this table. You could create a group called ‘orders_users’, add all of the users to that group, and then GRANT read permissions to that group. Then, only that group could access that table. By default, when you enable TAC, only admins and the user who created objects automatically are given access rights to those objects. For all other users, you will need to explicitly grant permissions to either them individually, or to their respective groups (always go with groups if you can!). There is also a limitation here: TAC only works with Python and SQL clusters – so if you are using Scala, this is not an option.
In my experience, TAC will work best if you have a group of analysts who are only querying a set of tables, or perhaps a ‘lake house’, and you want to have a workspace for them to work in where you can manage their permissions separate from the data lake. These users would normally not require to access the data lake directly, so TAC is sufficient! It also works well when you have many tables defined on data that is outside of the data lake, since AD Passthrough only passes credentials through to the lake.
Read here on how to enable table access control: https://docs.databricks.com/administration-guide/access-control/table-acl.html
Azure AD Passthrough allows the Active Directory credential that users used when logging into Databricks to be passed through to the Data Lake, where you can also set ACLs . This is a great option if you are using Azure Active Directory for identity management, because you only have to set ACLs in one place – on the data lake. Then, you don’t need to worry about managing a separate set of permissions in Databricks as well. Also, with TAC control enabled, if a user has the data lake credential (Databricks secret / etc), and they know where to look in the data lake, they can still technically get to the data there. Using ADLS Gen 2 ACLs and AD Passthrough prevents this ‘loophole’.
You can set ACLS at any object level within the data lake. For example, say you had 3 major data lake zones, raw, refined, and curated. You can set different permissions for each of these zones, or you could even go into subdirectories of these zones, and define permissions there. You can also set permissions at the file system level, if you desire.
ACLS within ADLS Gen 2 can get quite robust – but there is one thing to look out for. These permissions will not inherit on files and sub-directories that are already created. That is, if you decide to add a new permission to the raw zone of your data lake, but there is already a lot of data in that directory, the permissions will only inherit to NEW data added after the permissions were set up. You will need to write a script to loop through the directories and ‘manually’ assign the permissions to all the files and folders below. Thus, it is critical to get your data access personas defined early, so you can avoid this extra step down the line.
Additionally, let’s say you have a table ‘FinancialForecasts’ defined in Databricks on the curated zone of the data lake, that only a small group of people should have access to. If I am an analytics users in the same workspace as the small group that DOES have access to this table, I will be able to ‘see’ the table metadata in the data tab within the Databricks workspace. But if I am not granted permissions to the area in the data lake that the underlying data for this table lives, if I try to select from the table, I will get a permission error. So, while I can see the metadata, I won’t be able to access the data. Just something to consider.
Finally, you will need a Databricks premium plan to use this feature. Likely, if you are implementing this at an enterprise level, you will want a premium plan for other features anyways.
Here is where you can enable AD Passthrough when creating a Databricks Cluster:
Here is an example of what your production environment might look like:
As with all technology problems, there are many ways to accomplish the same goals. So please take the above with a grain of salt, and think critically on the requirements for your data platform before making any major decisions!
As always – open to any discussion or suggestions, and happy to connect!