Thursday, September 5, 2024

Architecting Azure

Nineteen hours into a job migrating data from Synapse to an Azure SQL Server, we see: 

Failure happened on 'Source' side. ErrorCode=SqlOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A database operation failed with the following error: 'A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)',Source=,''Type=System.Data.SqlClient.SqlException,Message=A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.),Source=.Net SqlClient Data Provider,SqlErrorNumber=64,Class=20,ErrorCode=-2146232060,State=0,Errors=[{Class=20,Number=64,State=0,Message=A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.),},],''Type=System.ComponentModel.Win32Exception,Message=The specified network name is no longer available,Source=,'

Yikes. This is after 226gb and 165 million rows have been written at an average throughput of 3.3MB/s. Three copy activities stopped within three seconds of each other but nothing untoward was found in the AzureDiagnostics and AzureActivity logs. At first I thought the network was suspiciously quiet at the time the copy came to an end but with the Azure logs and this Pandas code here, I found that brief pauses were not that unusual:

Bursty network logs
Other engineers said they see this intermittently. "Welcome to the world of cloud computing where Transient Faults are bound to happen" [SO]. A cloud solution architect at Microsoft writes that throttling may be"done via blocking the connections or denying the new connections to SQL Azure database engine". Or it could be the network. "In Azure, most of the components are running on the internet, and that internet connection can produce transient faults intermittently." [Azure for Architects]

Never assume that a network is reliable, whether it be the cloud or not. I worked in an investment bank where a developer would make a connection to a system and if it was connected, assume the failover system was live (this was a blue/green deployment). At first blush, this was not unreasonable but networks can be tricksy. As it happened, the network must have hiccuped and he was connected to the standby system pumping live data into it.