Friday, April 5, 2024

Network Adventures in Azure Databricks

My Azure Databricks cluster could not see one of my Blob containers although it could see others in the same subscription. The error in Databricks looked something like this: 

ExecutionError: An error occurred while calling o380.ls.
: Status code: -1 error code: null error message: java.net.SocketTimeoutException: connect timed outjava.net.SocketTimeoutException: connect timed out
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:423)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute(AbfsRestOperation.java:274)
        at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0(AbfsRestOperation.java:214)
        at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation(IOStatisticsBinding.java:464)
...

My first suspicion was that because they were in different resource groups, this could explain things.

Resource groups
"Resource groups are units of deployment in ARM [Azure Resource Manager]. 
"They are containers grouping multiple resource instances in a security and management boundary. 
"A resource group is uniquely named in a subscription. 
"Resources can be provisioned on different Azure regions and yet belong to the same resource group. 
"Resource groups provide additional services to all the resources within them. Resource groups provide metadata services, such as tagging, which enables the categorization of resources; the policy-based management of resources; RBAC; the protection of resources from accidental deletion or updates; and more... 
"They have a security boundary, and users that don't have access to a resource group cannot access resources contained within it.  Every resource instance needs to be part of a resource group; otherwise, it cannot be deployed." [Azure for Architects]
That last paragraph is interesting because I can access the container I want via the Azure portal. So, a friendly sysadmin suggested this was barking up the wrong tree and instead looked at:

Virtual Networks
"A VNet is required to host a virtual machine. It provides a secure communication mechanism between Azure resources so that they can connect to each other. 
"The VNets provide internal IP addresses to the resources, facilitate access and connectivity to other resources (including virtual machines on the same virtual network), route requests, and provide connectivity to other networks. 
"A virtual network is contained within a resource group and is hosted within a region, for example, West Europe. It cannot span multiple regions but can span all datacenters within a region, which means we can span virtual networks across multiple Availability Zones in a region. For connectivity across regions, virtual networks can be connected using VNet-to-VNet connectivity." [Azure for Architects]
Nothing obvious here. Both Databricks and the container were on the same network. However, they weren't on the same subnet.

Network Security Groups
"Subnets provide isolation within a virtual network. They can also provide a security boundary. Network security groups (NSGs) can be associated with subnets, thereby restricting or allowing specific access to IP addresses and ports. Application components with separate security and accessibility requirements should be placed within separate subnets." [Azure for Architects]
And this proved to be the problem. Databricks and the container are on the same virtual network but not the same subnet and there was an NSG blocking communication between these subnets.

Note that changes can take a few minutes to propagate, sometimes faster but sometimes slower. My sysadmin says he has seen it take up to an hour.

No comments:

Post a Comment