Agile Java Man: Big Data Patterns

Reduce- vs Map-side joins

When actual join logic is done in the reducers, we call this a reduce-side join. Here, "the mappers take the role of preparing the input data: extracting the key and value from each input record, assigning the key-value pairs to reducer partitions and sorting by key.

"The reduce-side approach has the advantage that you do not need to make any assumptions about the input data." However, it can be quite expensive.

"On the other hand, if you can make certain assumptions about your input data, it is possible to make joins faster by using a so-called map-side join [where] there are no reducers and no sorting. Instead, each mapper simply reads one input file log from the distributed filesystem and writes one output file to the filesystem - that is all." Prior map-reduce actions in the pipeline make this assumption reasonable since "the output from the mapper is always sorted before it is given to the reducer." [1]

Note that "the output of a reduce-side join is partitioned and sorted by the join key, whereas the output of a map-side join is partioned and sorted in the same way as the large input" regardless of whether a partitioned or broadcast (see below) join is used.

Map-side joins

Broadcast hash joins is where the data in one table is sufficiently small to fit in an in-memory hash table and be distributed to each mapper where the join can then take place.

Partitioned hash joins build on this idea. Here, the data of the larger table is partitioned on the key. "This has the advantage that each mapper can load a smaller amount of data into its hash table... Partitioned hash joins are known as bucketed map joins in Hive" [1]

A map-side merge join is where "a mapper can perform the same merging operation that would normally be done by a reducer: reading both input files incrementally, in order of ascending key and matching records with the same key."

Reduce-side joins

In this example taken from [1], we have an activity log (a "click stream") and a table of users.

A sort-merge join is where "the mapper output is sorted by key and the reducers then merge together the sorted lists of records from both sides of the join". [1] The reducer processes all of the records for a particular user ID in a single sweep keeping only one user record in memory at any one time.

"The MapReduce job can even arrange the records to be sorted such that the reducer always sees the record from the user database first, followed by the activity events in timestamp order - this technique is known as a secondary sort". [1]

Note that with the sort-merge join pattern, all tuples with the same key are sent to the same reducer.

If there are skewed 'hot keys' in Pig, it "first runs a a sampling key to see which keys are hot [then] the mappers send any records relating a hot key to one of several reducers chosen at random." [1] This is a skewed join. "Hive's skewed join optimization takes an alternative approach. It requires hot keys to be specified explicitly in the table metadata."

Streaming

From Understanding Kafka Consumer Groups at DZone:

"What Is a Consumer Group? ... it turns out that there is a common architecture pattern: a group of application nodes collaborates to consume messages, often scaling out as message volume goes up, and handling the scenario where nodes crash or drop out.

"A Consumer Group based application may run on several nodes, and when they start up they coordinate with each other in order to split up the work. This is slightly imperfect because the work, in this case, is a set of partitions defined by the Producer. Each Consumer node can read a partition and one can split up the partitions to match the number of consumer nodes as needed. If the number of Consumer Group nodes is more than the number of partitions, the excess nodes remain idle. This might be desirable to handle failover. If there are more partitions than Consumer Group nodes, then some nodes will be reading more than one partition."

[1] Designing Data Intensive Applications, Klepperman

Agile Java Man

Saturday, August 10, 2019

Big Data Patterns

No comments:

Post a Comment

Blog Archive

About Me