Thursday, February 15, 2024

Spark and Schemas

I helped somebody on Discord with a tricksy problem. S/he was using a Python UDF in PySpark and seeing NullPointerExceptions. This suggests a Java problem as the Python error message for an NPE looks more like "AttributeError: 'NoneType' object has no attribute ..." But why would Python code cause Spark to throw an NPE?

The problem was the UDF was defining a returnType struct that stated a StructField was not nullable.


The line charge_type.lower (highlighted) was a red herring as they had clearly changed more than one thing when experimenting (always change one thing at a time!)

Note that Spark regards the nullable field as advisory only.
When you define a schema where all columns are declared to not have null values , Spark will not enforce that and will happily let null values into that column. The nullable signal is simply to help Spark SQL optimize for handling that column.
- Spark, The Definitive Guide
And the reason is in this code where Spark is generating bespoke code. If nullable is false, it does not check the reference unnecessarily. But if there reference is null, Spark barfs like so:

Caused by: java.lang.NullPointerException
        at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_2$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
        at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$11(EvalPythonExec.scala:148)

So, the Python returned without an NPE but caused the JVM code to error as the struct it returns contains nulls when it said it wouldn't.

No comments:

Post a Comment