Thursday, February 18, 2021

PySpark for JVM Developers

PySpark uses Py4J to talk to Spark. "Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine" [Py4J]. The Python class structure mirrors the Scala structure closely but is otherwise a totally separate codebase. Note that the functionality is not necessarily 1-to-1. Parts of Spark only available to the JVM. For instance, Dataset.groupByKey is not available to Python in Spark 2.x, although it is in Spark 3.

PySpark within notebooks

Python integrates with Spark quite nicely in Jupiter and Zeppelin. You just need some code that looks like:

%pyspark

from pandas import DataFrame

dfSpark = sqlContext.sql("select * from MY_TABLE")
df = dfSpark.toPandas()
...

and you can be manipulating your data in Pandas. But this is hard to test.

If you want to run some Scala on data that was processed with Python, you can always run something like this in the pyspark cell:

pythonDF.registerTempTable("TableName")

and this in a %scala cell:

val df = sqlContext.sql("select * from TableName")
 
thus passing the data between the two languages.

PySpark outside notebooks

Automated testing of Spark via Python is available via PyTest (see this SO answer of some sample code). It fires up a Spark JVM instance locally in the background and stops it when the tests are done. The JAR for this Spark instance will live in your site-packages/pyspark/jars/ directory and the JVM in your PATH environment variable will be used.

Note that it's much easier to do all of this in the JVM using the BigDataTesting library which allows Hadoop, Spark, Hive and Kafka to be tested in a single JVM [disclaimer: I'm the author]. But since Python is being used more and more in data engineering, it appears possible [SO] to do this in via Py4J as well.

Building and deploying

To create a deployable artifact, you can package your code in a .whl ("wheel") file which can be easily deployed to something like DataBricks. I have a simple PySpark project in my GitHub repository that demonstates how to test and package Spark code in Python.