Introduction
Caveat: I've just started learning Apache Beam specifically on GCP and these are some notes I've made. They may at this stage be unreliable. I will update this entry as I learn more.
Superficially like Spark
"Side inputs are useful if your ParDo needs to inject additional data when processing each element in the input PCollection, but the additional data needs to be determined at runtime (and not hard-coded). Such values might be determined by the input data, or depend on a different branch of your pipeline." (the Beam docs).
Side inputs are a bit like Spark’s broadcast variables, I'm told.
The Pipeline.apply(“name”, ParDo.of(new XXXDoFn)) is the Beam equivalent of Spark sending a function to its executors. The XXXDoFn must be serializable.
DoFns are annotated with @ProcessElement tag which passes in the required objects. Note that the DoFn does not need to conform to any particular interface.
Partitioning
“The elements in a PCollection are processed in bundles. The division of the collection into bundles is arbitrary and selected by the runner. This allows the runner to choose an appropriate middle-ground between persisting results after every element, and having to retry everything if there is a failure
“If processing of an element within a bundle fails, the entire bundle fails. The elements in the bundle must be retried (otherwise the entire pipeline fails), although they do not need to be retried with the same bundling.”
(from the Beam Documentation.)
The Runner
Beam is designed to run on many different platforms and therefore has many different Runners, with differing degrees of capability. These Runners allow your Beam code to run on Spark, Flink etc. Of note is the Direct Runner.
“The Direct Runner executes pipelines on your machine and is designed to validate that pipelines adhere to the Apache Beam model as closely as possible… the Direct Runner performs additional checks” (the Beam Documentation).
This Runner should be fine for running tests. You can also use TestPipeline "inside of tests that can be configured to run locally or against a remote pipeline runner."
DataFlow
DataFlow is Google's implementation of a Beam runner.
“After you write your pipeline, you must create and stage your template file… After you create and stage your template, your next step is to execute the template.” (from the DataFlow documentation).
What sorcery is this?
"Developer typically develop the pipeline in the development environment and execute the pipeline and create a template. Once the template is created, non-developer users can easily execute the jobs by using one of these interfaces including GCP console, gcloud cli, or REST API... We simply run this command to compile and stage the pipeline so that it becomes accessible from several execution interfaces:
mvn compile exec:java \
-Dexec.mainClass=com.example.MyClass \
-Dexec.args="--runner=DataflowRunner \
--project=[YOUR_PROJECT_ID] \
--stagingLocation=gs://[YOUR_BUCKET_NAME]/staging \
--templateLocation=gs://[YOUR_BUCKET_NAME]/templates/
according to Imre Nagi who has mixed views particularly on using Composer, Google's managed version of AirFlow ("especially about it’s versioning and debugging technique") for managing DataFlow jobs.
He says that one issue is "you must modify your pipeline code to support runtime parameters". This seems to run contrary to Beam's attempt to be an abstraction across all Big Data processing applications.
For instance, you need to use ValueProviders to pass config around. But, if your "driver" code accesses them (say in com.example.MyClass's main method) they won't be available in the template creation stage.