PySpark uses Py4J to talk to Spark. "Py4J enables Python programs running in a Python interpreter to dynamically access Java objects in a Java Virtual Machine" [Py4J]. The Python class structure mirrors the Scala structure closely but is otherwise a totally separate codebase. Note that the functionality is not necessarily 1-to-1. Parts of Spark only available to the JVM. For instance, Dataset.groupByKey is not available to Python in Spark 2.x, although it is in Spark 3.
PySpark within notebooks
Python integrates with Spark quite nicely in Jupiter and Zeppelin. You just need some code that looks like:
%pyspark
from pandas import DataFrame
dfSpark = sqlContext.sql("select * from MY_TABLE")
df = dfSpark.toPandas()
...
PySpark outside notebooks
Automated testing of Spark via Python is available via PyTest (see this SO answer of some sample code). It fires up a Spark JVM instance locally in the background and stops it when the tests are done. The JAR for this Spark instance will live in your site-packages/pyspark/jars/ directory and the JVM in your PATH environment variable will be used.
Note that it's much easier to do all of this in the JVM using the BigDataTesting library which allows Hadoop, Spark, Hive and Kafka to be tested in a single JVM [disclaimer: I'm the author]. But since Python is being used more and more in data engineering, it appears possible [SO] to do this in via Py4J as well.
Building and deploying
To create a deployable artifact, you can package your code in a .whl ("wheel") file which can be easily deployed to something like DataBricks. I have a simple PySpark project in my GitHub repository that demonstates how to test and package Spark code in Python.
No comments:
Post a Comment