Monday, November 18, 2019

Mutation in Spark


... is dangerous. This came up on the Spark Gitter channel a few days ago (11/11/19). Somebody's Spark job was throwing OptionalDataException. This exception indicates "the failure of an object read operation due to unread primitive data, or the end of data belonging to a serialized object in the stream".

Now, looking at the JavaDocs for this Exception, there is no good reason why Spark would occasionally throw this. The person (alina_alemzeb_twitter) reported:
"I don't understand what else could be the culprit. Our Spark job is pretty basic, we're just doing transformations and filtering. The job ran fine for 10 hrs and then crashed with this error and this is all the code there is."
The fact it ran without problems for ten hours suggests that it wasn't a library incompatibility problem which was my first thought (maybe an object to be serialized had missing fields between different versions of a library?). It was only upon holistic code inspection that clues to its real nature became clear.

Basically, a HashMap that can be accessed from a closure is sent from the driver to the executors. This HashMap can momentarily be mutated on the driver. If one were unlucky, the HashMap could be mutated at just the moment the closure was being serialized. And it's this that lead to the non-deterministic nature of the bug.

So, there was no Spark bug but a bug in the code that calls Spark. The take-away point is that mutation is an evil but mutation in a distributed environment is sheer folly.

No comments:

Post a Comment