Agile Java Man: Databricks in an ADF pipeline

Wednesday, January 29, 2025

Databricks in an ADF pipeline

ADF is a nasty but ubiquitous techology in the Azure space. It's low-code and that means a maintenance nightmare. What's more, if you try to do anything remotely clever, you'll quickly hit a brick wall.

Fortunately, you can make calls from ADF to Databricks notebooks where you can write code. In our case, this was to grab the indexing SQL from SQL Server and version control it in Git. At time of writing, there was no Activity in ADF to access Git.

To access Git you need a credential. You don't want to hardcode it into notebook as anybody who has access to it can see it. So, you store it as a secret with something like:

echo GIT_CREDENTIAL | databricks secrets put-secret YOUR_SCOPE YOUR_KEY

databricks secrets put-acl YOUR_SCOPE YOUR_USER READ

where YOUR_SCOPE is the namespace in which the secret lives (can be anything modulo some reserved strings); YOUR_KEY is the key for retrieval and YOUR_USER is the user onto whom you wish to bestow access.

Now, your notebook can retrieve this information with:

git_credential = dbutils.secrets.get(scope="YOUR_SCOPE", key="YOUR_KEY")

and although others might see this code, the value is only accessible at runtime and only when YOUR_USER is running it.

Next, we want to get an argument that is passed to the Databricks notebook from ADF. You can do this in Python with:

dbutils.widgets.text(MY_KEY, "")

parameter = dbutils.widgets.get(MY_KEY)

where MY_KEY is a base parameter in the calling ADF Notebook Activity, for example:

Passing arguments to a Databricks notebook from ADF

Finally, you pass a return value back to ADF with:

dbutils.notebook.exit(YOUR_RETURN_VALUE)

and use it back in the ADF workflow by referencing @activity("YOUR_NOTEBOOK_ACTIVITY_NAME").output.runOutput in this case where we want to run the SQL the notebook returned:

ADF using the value returned from the Databricks notebook

That's how ADF talks to Databricks and how Databricks replies. There's still the small matter of how the notebook invokes Git.

This proved unexpectedly complicated. To check the code out was easy:

import subprocess

print(subprocess.run(["git", "clone", f"https://NHSXEI:{git_credential.strip()}@dev.azure.com/YOUR_GIT_URL/{repo_name}"], capture_output=True))

but to run some arbitrary shell commands in that directory via Python was complicated as I could not cd into this new directory. This is because cd is a built-in [SO] of the shell not an executable. So, instead I used a hack: my Python wrote some dynamically generated shell script, wrote it to a file and executed it from a static piece of shell script. Bit ick but does the job.

Agile Java Man

Wednesday, January 29, 2025

Databricks in an ADF pipeline

No comments:

Post a Comment

Blog Archive

About Me