Monday, November 11, 2024

Databricks in Azure

The Databricks plugin for VSCode is actually quite nice. Once installed, you can login via your IDE:


and Sign in to Databricks workspace using profile in your ~/.databrickscfg file. For my windows VM, this is just a host string pointing at my cluster and a auth_type of databricks-cli. Signing in opens a browser and I just authorise my session.

Now, the nice thing is that the code I develop in VSCode locally can be run remotely on the cluster by choosing Run File as Workflow


This uploads all my code into a Databricks cluster and I can see the job running clicking the link:
which opens my browser to the job's console.

Interestingly. trying to read via a notebook a Spark Dataframe that had been persisted as Parquet from this job lead to an obscure error. Apparently, I have to run:

spark.conf.set("spark.databricks.io.cache.enabled", "false")

Databricks as part of the Pipeline

Azure Data Factory is a little like Airflow but trying to do something clever with it (eg, use a CLI) is difficult. We found that having it invoke a Databricks notebook allowed us to do anything a little more sophisticated.

Note that you'll have to create a Databricks Linked Service to allow your ADF pipeline access to the notebook. These Microsoft docs proved helpful.

If you want the notebook to read from a parameter ADF passes to it, use something like this:

def read_param(key: str) -> str:
    dbutils.widgets.text(key, "")
    parameter = dbutils.widgets.get(key)
    return parameter

where key corresponds to one of the Base Parameters in the Notebook ADF activity. Note that there is a 10k character limit on these arguments [SO].

If you want the notebook to return something back to the ADF flow, have it terminate with:

dbutils.notebook.exit(YOUR_VALUE_HERE)

[See this SO.]

Connecting Databricks to a Database

If you want to connect a Databricks notebook to a cloud-based SQL Server instance, you can run:

%pip install adal
%pip install openpyxl

servername = "YOUR_MACHINE_NAME"
jdbcHostname = f"{servername}.database.windows.net"
jdbcDatabase = "YOUR_DB" 
driver = "com.microsoft.sqlserver.jdbc.spark"
url = f"jdbc:sqlserver://{jdbcHostname};databaseName={jdbcDatabase}"

as a one off, then:

import adal 
authority = f"https://login.microsoftonline.com/common"
context = adal.AuthenticationContext(authority)
jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)

client_id = "YOUR_CLIENT_ID" # Client ID for Azure Cloud Shell (you do not have to replace this)
resource_app_id_url = "https://database.windows.net/" 

code = context.acquire_user_code(resource_app_id_url, client_id)
token = context.acquire_token_with_device_code(resource_app_id_url, code, client_id)
access_token = token["accessToken"]

Now you can use Spark to read the DB with the help of this token:

from pyspark.sql import SQLContext

jdbc_db_practice_mapping = (
        spark.read
        .format("com.microsoft.sqlserver.jdbc.spark")
        .option("url", url)
        .option("dbtable", "RightCare_ref.practice_CCG_quarter_lookup")
        .option("accessToken", access_token)
        .option("encrypt", "true")
        .option("hostNameInCertificate", "*.database.windows.net")
        .load()
)

Et voila. You're reading from SQL Server.