Saturday, October 5, 2024

Optimising GPU code

I complained to Juan Fumeron that a benchmark indicated that the GPU was not giving much of a performance improvement. JMH reported the GPU being a moderate 20% faster than the CPU:

tornadojar tornado-benchmarks/target/jmhbenchmarks.jar uk.ac.manchester.tornado.benchmarks.sgemv.JMHSgemV
...
Benchmark              Mode  Cnt         Score         Error  Units
JMHSgemV.sgemVJava     avgt    5  72366270.751 ± 5916807.539  ns/op
JMHSgemV.sgemVTornado  avgt    5  57583087.103 ± 2523449.341  ns/op

(SGEMM is single precision general matrix multiplication. GEMV indicates that we're multiplying a matrix with a vector).

Juan replied that I should try TornadoVM's  --enableProfiler console switch and see where the time was being spent. Sure enough, COPY_IN_TIME was ~28ms, about the same as TOTAL_KERNEL_TIME.

Note that the total kernel time is the time it takes the GPU to perform the computation and the total kernel dispatch time is the time it takes to schedule the kernel (ie, the function being executed). In this case, dispatch time is ~6us - three orders of magnitude smaller than the execution time.

Juan also said that "Matrix Vector is not as compute intensive as other applications", so instead I tried the matrix/matrix multiplication. Here, the GPU shines:

Benchmark              Mode  Cnt           Score         Error  Units
JMHSgemm.sgemmJava     avgt    5  1773297262.188 ± 4115731.439  ns/op
JMHSgemm.sgemmTornado  avgt    5     8478409.506 ±  246919.368  ns/op

That makes the GPU 200 times faster than the CPU. Now COPY_IN_TIME is about 1ms and TOTAL_KERNEL_TIME is about 5.5ms.

Now we're talking. But continuing this optimization rampage, it's worth noting that "It has become tribal knowledge that the particular shapes chosen for matmuls has a surprisingly large effect on their performance." [Horace He] TL;DR; He's article explains how fitting the small memory tiles onto a large matrix can hugely change performance - basically, that in a row-major MxN matrix, N must be a factor of the GPU's cache line for best results.

Changes in Java's memory

In the old days, we'd use sun.misc.Unsafe.allocateMemory to use off-heap memory. This code goes straight to the OS and asks for memory via os::realloc. But using Unsafe is bad practise. Not only is it specific to a particular flavout of JVM, it allows access to raw memory. The latter is "fine" if that memory is off-heap but if you are using it to access a Java object, the garbage collector can change its memory location without warning.

There are several modern alternatives. Since Java 9, java.lang.invoke.VarHandle has been the recommended alternative. It provides the same level of low-level access as Unsafe but with better safety and control over memory visibility. That is, its memory access patterns apparently offer finer grained control - eg, volatile access without enforcing strict instruction ordering. 

It's interesting to note that the high performing interoperability framework, Apache Arrow, does not use VarHandle. It still uses Unsafe as VarHandle has bound checking etc that is slower than raw access. 

Since Java 20, we've had Project Panama's Foreign Function & Memory API (JEP-424) spec (it appears Apache Arrow doesn't use it because it's too new). If we run this code:

MemorySegment memorySegment = Arena.global().allocate(1024 * 1024 * 128, 8);         System.out.println(memorySegment.address());

then look for the address while it's still running in /proc/PID/maps (where PID is the ID of the Java process), we can see that the Linux OS now manages a new area of memory. For instance, when I ran it, the output was 0x7fbaccdbe010 and I can see in the maps pseudo file:

7fbaccdbe000-7fbad4dbf000 rw-p 00000000 00:00 0 

This represents the 128 megs of space plus 4096 bytes (presumably a page for meta data).

Now, since Java and C/C++ are IEEE 754 compliant, and now they can pass native memory to each other, you can transparently pass floating point numbers between code bases and run the C/C++ program in the JVM - no more need for JNI! (Interestingly, note that Python is often IEEE754 compliant but it is not guaranteed to be).

It's interesting to note that the GPU enabled Tornado VM uses the java.lang.foreign package to move data to and from the GPU.

Friday, September 27, 2024

Running Iceberg Catalogs in a test suite


When trying to put some BDDs together here (GitHub) for Iceberg and Spark integration, I hit some snags. I was using the local catalog. If I used the Hive catalog, I was getting this error ("Iceberg does not work with Spark's default hive metastore"). 

The reason is that Derby, which is bundled with the Hive metastore in Spark, "doesn't support the interfaces (or concurrent access) required for Iceberg to work" according to Russell Spitzer. He suggests using the Hadoop catalog instead. 

I didn't fancy setting up a more robust Hive metastore that uses a more "professional" database like Postgres so I  followed Russel's advice to use the Hadoop catalog... and immediately hit this problem that's discussed more in Slack. Basically, CALLing via Spark SQL some procedures throws an IllegalArgumentException stating "Cannot use non-v1 table". This is a Spark v1 table, and nothing to do with Iceberg v1 tables, it seems.

OK, so this got me thinking that if an easy way to use Hive and Hadoop catalogs don't work, let's try a REST catalog. Apache Polaris seemed a good candidate. At first, I was hoping to have it run in-process but because of transitive dependencies, it wasn't possible to have everything running easily in the same JVM.

Hmm, so we need it in a separate process. OK, we can Docker-ise that. After adjusting the fixtures, I noticed that while running an erstwhile passing test for a second time leads to a failure. Looking at the stack traces, it appears that sure, we DROP TABLE IF EXISTS ..., but the underlying files are left behind. Russell Spitzer helped me out (again) by pointing out that a PURGE keyword needs to be added to the SQL.

Does the catalog also need to delete metadata files even if the query engine deletes the data files?
Russell Spitzer
This isn't actually defined anywhere. So at the moment we are kind of in a wild west, the Spark implementation deletes absolutely everything. I believe for Trino it just sends the request to the catalog and it's up to the catalog to decide. So for Spark "no", and it explicitly sends a drop table request to the rest catalog without a purge flag.
Russell says Polaris will delete files with this flag [caveat] although there is debate about making this service asynchronous ("I think polaris could own this service too but I know some folks (like netflix) have delete services already"), and "the option to hook into an external delete service".

So, there is some latitude in how one implements a catalog, it seems. This would explain why some of my tests are failing - some assumptions I made are not guaranteed. Still working on making them pass.

Thursday, September 26, 2024

Azure Automation

It's surprising how many silly little jobs can be automated away for big productivity wins. I wrote some Python code to run on a laptop that plays with Azure cloud. The main takeaway points were:

The best way to get access is by using azure.identity.InteractiveBrowserCredential in the azure-identity package. This opens the browser and prompts the user to login. This can then be used to instantiate a azure.storage.filedatalake.DataLakeServiceClient in azure-storage-file-datalake. With this we can upload files.

Then, to kick off ADF jobs, we can use the same credential and pass it to azure.mgmt.datafactory.DataFactoryManagementClient in azure-mgmt-datafactory. As a nice little aside, you can use webbrowser in the web-browser package that will open your browser to the ADF job by crafting a URL with the job ID so you can monitor its progress.

This all seemed pretty straight forward but accessing an Azure SQL instance was somewhat harder. To access the credential, I referred to this SO answer. Here I used the InteractiveBrowserCredential as before but then I needed:

    token = credential.get_token("https://database.windows.net/.default")
    connection_string = f"""
        Driver={Driver};
        Server={server};
        Database={database};
        Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;
    """
    token_bytes = token.token.encode("UTF-16-LE")
    token_struct = struct.pack(f'<I{len(token_bytes)}s', len(token_bytes), token_bytes)
    conn = pyodbc.connect(connection_string, attrs_before = { 1256:token_struct });

which is a bit more involved but allowed me to connect.


Sunday, September 22, 2024

GPU Notes

Minecraft is written in Java and uses JOCL to leverage the GPU[Oracle blog, 2020]
The organization of threads in CUDA terms:
  1. Thread: single unit of execution --- each thread has its own memory called registers
  2. Block: group of threads --- all threads in a block has access to a shared memory [Cuda Terminology].
  3. Grid: group of blocks --- all threads in a grid has access to [mutable] global memory and [immutable, global] constant memory.
 [Penny Xu's blog]
Matrix multiplication is best done using tiles. The reason for this is that when you ask for a value, you actually get an array of them. You could throw them away. Or, you could employ a clever optimization (descibed by Horace He here) where we process the surplus and save for later.

TornadoVM

Oracle's Gary Frost calls [YouTube] TornadoVM as "state of the art" when bringing GPU processing to the JVM.

"There are two ways to express parallelism with TornadoVM" [Juan Fumero] Annotations and "via an explicit parallel kernel API" using KernelContext

Interestingly, by using the Tornado --printKernel --printBytecodes --fullDebug flags, you can get Tornado to print out all sorts of goodies like the raw backend code it has created.

OpenCL Terminology

We can use a poker game to convey the idea of what is going on in OpenCL. Imagine a dealer and players who don’t interact with each other but who make requests to the dealer for additional cards.  

"In this analogy, the dealer represents an Open CL host, each player represents a device, the card table represents a context, and each card represents a kernel. Each player’s hand represents a command queue." [1]

"[Nested] Loops like this are common but inefficient. The inefficiency arises because each iteration requires a separate comparison and addition. Comparisons are time-consuming on the best of processors, but they’re especially slow on dedicated number-crunchers like graphic processor units ( GPU s). GPU s excel at performing the same operations over and over again, but they’re not good at making decisions. If a GPU has to check a condition and branch, it may take hundreds of cycles before it can get back to crunching numbers at full speed.

"One fascinating aspect of Open CL is that you don’t have to configure these loops in your kernel. Instead, your kernel only executes code that would lie inside the innermost loop. We call this individual kernel execution a work-item." [1]

"A work-group is a combination of work-items that access the same processing resources... Work-items in a work-group can access the same block of high-speed memory called local memory. Work-items in a work-group can be synchronized using fences and barriers." [1]

"One of the primary advantages of using Open CL is that you can execute applications using thousands and thousands of threads, called work-items." [1]

"Each block of local memory is specific to the work-items in a work-group." [1]

[1] OpenCL in Action

Thursday, September 5, 2024

Architecting Azure

Nineteen hours into a job migrating data from Synapse to an Azure SQL Server, we see: 

Failure happened on 'Source' side. ErrorCode=SqlOperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=A database operation failed with the following error: 'A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)',Source=,''Type=System.Data.SqlClient.SqlException,Message=A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.),Source=.Net SqlClient Data Provider,SqlErrorNumber=64,Class=20,ErrorCode=-2146232060,State=0,Errors=[{Class=20,Number=64,State=0,Message=A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.),},],''Type=System.ComponentModel.Win32Exception,Message=The specified network name is no longer available,Source=,'

Yikes. This is after 226gb and 165 million rows have been written at an average throughput of 3.3MB/s. Three copy activities stopped within three seconds of each other but nothing untoward was found in the AzureDiagnostics and AzureActivity logs. At first I thought the network was suspiciously quiet at the time the copy came to an end but with the Azure logs and this Pandas code here, I found that brief pauses were not that unusual:

Bursty network logs
Other engineers said they see this intermittently. "Welcome to the world of cloud computing where Transient Faults are bound to happen" [SO]. A cloud solution architect at Microsoft writes that throttling may be"done via blocking the connections or denying the new connections to SQL Azure database engine". Or it could be the network. "In Azure, most of the components are running on the internet, and that internet connection can produce transient faults intermittently." [Azure for Architects]

Never assume that a network is reliable, whether it be the cloud or not. I worked in an investment bank where a developer would make a connection to a system and if it was connected, assume the failover system was live (this was a blue/green deployment). At first blush, this was not unreasonable but networks can be tricksy. As it happened, the network must have hiccuped and he was connected to the standby system pumping live data into it. 

Sunday, July 21, 2024

Hive for an Iceberg test suite

I'm currently having trouble changing my catalog implementation from hadoop to hive for my test suite. The tests work fine as they are but they're backed by a local filesystem which is not atomic. For the purposes of testing, this seemed OK but the fact they were not really realistic was nagging me.

Spark can use numerous catalogs. You just need to set them up in SparkConf by giving the config their own namespaces below spark.sql.catalog. Your call to writeTo will use the default namespace unless another is defined.

If you use a Hive catalog, concurrent creation of a table will blow up with:

org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: [TABLE_OR_VIEW_ALREADY_EXISTS] Cannot create table or view `database`.`concurrentwritespec` because it already exists.
Choose a different name, drop or replace the existing object, or add the IF NOT EXISTS clause to tolerate pre-existing objects.
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:255)

Whilte backed by a local file system, the test correctly encounters an error but this time it's:

org.apache.spark.util.Utils - Aborting task
org.apache.iceberg.exceptions.CommitFailedException: Version 1 already exists: /tmp/SparkForTesting5181526160591714739/ConcurrentWriteSpec/metadata/v1.metadata.json

The behaviour for both is entirely expected but the actual Exceptions diff.

Meanwhile, a test that merely appended data blew up (see this test) using hive rather than hadoop failing with:

Cannot write into v1 table: `spark_catalog`.`database`.`spark_file_test_writeto`.

The reason for this currently eludes me but it seems I am not the only one.

Spark Datasource API V1 and V2

Amongst the problem with V1 was "Partitioning and Sorting is not propagated from the data sources, and thus, not used in the Spark optimizer" [Qorela]. V2 gives much greater freedom to developers of external stores, like various cloud solutions. For example Russel Spitzer says:
"Now V2 Predicate function pushdown is allowed (or at least should be coming soon) but for that you must use the datasource functions and not the spark wones" [Slack]

"Best practice for partitions is partition by low cardinality and sort for high cardanality. This may reduce the level of partitions you need" [Slack]
Another key difference between the two is atomicity. "Spark ships with two default Hadoop commit algorithms — version 1, which moves staged task output files to their final locations at the end of the job, and version 2, which moves files as individual job tasks complete...  the v2 Hadoop commit protocol is almost five times faster than v1. This is why in the latest Hadoop release, v2 is the default commit protocol... while v2 is faster, it also leaves behind partial results on job failures, breaking transactionality requirements. In practice, this means that with chained ETL jobs, a job failure — even if retried successfully — could duplicate some of the input data for downstream jobs." [Databricks]

Remediation attempts

Changing the table made no difference to the error:

      spark.sql(
        s"""
          |ALTER TABLE $fqn
          |SET TBLPROPERTIES ('format-version'='2')
      """.stripMargin)

So, I tried creating the table with:

... USING ICEBERG TBLPROPERTIES ('format-version'='2')

only for it to then complain upon table creation:

ERROR | o.a.h.h.metastore.RetryingHMSHandler - MetaException(message:Unable to update transaction database java.sql.SQLSyntaxErrorException: Table/View 'NEXT_LOCK_ID' does not exist.

coming from Hive when it calls Derby. "The Derby catalog doesn't support the interfaces (or concurrent access) required for iceberg to work" [GitHub issue]

So, for the time being at least, it seems that my tests will need to be somewhat unrealistic when it comes to atomicity.