Thursday, October 12, 2023

Dependency hell

In these days of ChatGPT, it's easy to forget that most of the time, a developer isn't actually cutting code at all, but debugging it. This is my own personal hell in getting Spark and Kafka in Docker containers talking to a driver on the host.

Firstly, I was seeing No TypeTag available when my code was trying to use the Spark Encoders. This SO answer helped. Basically, my code is Scala 3 and "Encoders.product[classa] is a Scala 2 thing. This method accepts an implicit TypeTag. There are no TypeTags in Scala 3". Yikes. This is probably one reason the upgrade path in Spark to Scala 3 is proving difficult. The solution I used was to create a SBT sub Project that was entirely Scala 2 and from here I called Spark.

The next problem was seeing my Spark jobs fail with:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6) (172.30.0.7 executor 0): java.lang.ClassCastException: cannot assign instance of scala.collection.generic.DefaultSerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDDPartition.inputPartitions of type scala.collection.immutable.Seq in instance of org.apache.spark.sql.execution.datasources.v2.DataSourceRDDPartition

This is a contender for the error message with the greatest misdirection. You think it's a serialization problem but it isn't directly so.

Although other Spark users have reported it, Ryan Blue mentions that it isn't really a Spark issue but a Scala issue.

Anyway, I tried all sorts of things like change my JDK (note the sun.* packages have been removed in later JDKs so you need to follow the advice in this SO answer). I tried creating an uber jar but was thwarted by duped dependencies [SO], Invalid signature file digest errors as some were signed [SO] that forced me to strip the signtures out [SO] but still falling foul of Kafka's DataSourceRegister file being stripped out [SO].

The first step in the right direction came from here and another SO question where the SparkSession is recommended to be built by adding .config("spark.jars", PATHS) where PATHS is a comma delimited string of the full paths of all the JARs you want to use. Surprisingly, this turned out to include Spark JARs themselves, including in my case spark-sql-kafka-0-10_2.13 which oddly does not come as part of the Spark installation. By adding them as spark.jars, they are uploaded into the work subdirectory of a Spark node.

After this, there was just some minor domain name mapping issues to clear up in both the host and container before the whole stack worked without any further errors being puked.

No comments:

Post a Comment