Reduce- vs Map-side joins
When actual join logic is done in the reducers, we call this a reduce-side join. Here, "the mappers take the role of preparing the input data: extracting the key and value from each input record, assigning the key-value pairs to reducer partitions and sorting by key.
"The reduce-side approach has the advantage that you do not need to make any assumptions about the input data." However, it can be quite expensive.
"On the other hand, if you can make certain assumptions about your input data, it is possible to make joins faster by using a so-called map-side join [where] there are no reducers and no sorting. Instead, each mapper simply reads one input file log from the distributed filesystem and writes one output file to the filesystem - that is all." Prior map-reduce actions in the pipeline make this assumption reasonable since "the output from the mapper is always sorted before it is given to the reducer." [1]
Note that "the output of a reduce-side join is partitioned and sorted by the join key, whereas the output of a map-side join is partioned and sorted in the same way as the large input" regardless of whether a partitioned or broadcast (see below) join is used.
Map-side joins
Broadcast hash joins is where the data in one table is sufficiently small to fit in an in-memory hash table and be distributed to each mapper where the join can then take place.
Partitioned hash joins build on this idea. Here, the data of the larger table is partitioned on the key. "This has the advantage that each mapper can load a smaller amount of data into its hash table... Partitioned hash joins are known as bucketed map joins in Hive" [1]
A map-side merge join is where "a mapper can perform the same merging operation that would normally be done by a reducer: reading both input files incrementally, in order of ascending key and matching records with the same key."
In this example taken from [1], we have an activity log (a "click stream") and a table of users.
A sort-merge join is where "the mapper output is sorted by key and the reducers then merge together the sorted lists of records from both sides of the join". [1] The reducer processes all of the records for a particular user ID in a single sweep keeping only one user record in memory at any one time.
"The MapReduce job can even arrange the records to be sorted such that the reducer always sees the record from the user database first, followed by the activity events in timestamp order - this technique is known as a secondary sort". [1]
Note that with the sort-merge join pattern, all tuples with the same key are sent to the same reducer.
If there are skewed 'hot keys' in Pig, it "first runs a a sampling key to see which keys are hot [then] the mappers send any records relating a hot key to one of several reducers chosen at random." [1] This is a skewed join. "Hive's skewed join optimization takes an alternative approach. It requires hot keys to be specified explicitly in the table metadata."
Streaming
From Understanding Kafka Consumer Groups at DZone:
"What Is a Consumer Group? ... it turns out that there is a common architecture pattern: a group of application nodes collaborates to consume messages, often scaling out as message volume goes up, and handling the scenario where nodes crash or drop out.
[1] Designing Data Intensive Applications, Klepperman
No comments:
Post a Comment