After spending the last 18 months using Spark to write an entity resolution software for over a terrabyte of data, here are some miscellaneous notes of what I wish I'd known from the start. In no particular order:
1. Make sure your app can recover from a failure easily. Write to HDFS after each major stage. This will also help debugging when the answer that comes out of the sausage machine wasn't entirely what you were expecting. ("Simply split long-running jobs into batches and write intermediate results to disk. This way, you have a fresh environment for every batch and don’t have to worry about metadata build-up" from here). Also, the topology for one stage may not be appropriate for another (see here for an example where smaller numbers of executors with more resources - contrary to the general Spark advice - gives better performance).
2. Small inefficiencies can cause big problems. Use jstack liberally.
3. Don't use Spark as a key/value lookup. That's not what it's built for. Use another system. Don't try to hack it by using a broadcast variable as that simply doesn't scale.
4. Use realistic data, both in size and quality. Making fake data is surprisingly hard if you want the output to remotely correspond to the real world.
5. Have an automated test in a realistic environment (you don't want authentication problems to show up late, for example). Run the app daily in this environment to show any performance changes
6. "Don't start modeling before designing some measurable goals" [1]. Define acceptance criteria early on as to what you'd expect to come out and what you wouldn't. Estimate the false positive/negative rate. For example, at one point we expected 220k of company entities to resolve with Orbis data. Using a very simple query, we were seeing about 130k of our businesses resolve to something. Therefore, the true positive rate could not be higher than about 60% (and may have been less) therefore there was work to do here.
7. Pass small objects around, preferably just IDs. This is what the built-in Connected Component algorithm does. It will improve performance.
The stages of my app
There were 6 stages to my application:
1. Build a matrix using TF-IDF to assign weights to words.
2. Calculate the cosine similarities.
3. Execute bespoke business rules.
4. Find connected components.
5. Turn those IDs back into entities.
6. Consolidate the relationships between these resolved entities in a process called Edge Contraction.
There were some interesting modifications to this basic flow.
1. Feature hashing improved run time performance but the largest connected component went from 600 to 26 000 (BlackRock/Merrill Lynch who seem to have created what appears to be a lot of shell companies with similar names).
2. By ignoring all words that appear in over 1000 documents, there was no need for stop words. This was useful since the corpus was multilingual.
A note on requirements gathering
One tip in finding which database suits you come from the late Dr Jim Gray. "Gray's recipe for designing a database for a given discipline is that it must be able to answer the key 20 questions that the scientist wants to ask of it." [2] A real-world example can be found here. The idea being that 20 questions is the roughly the minimum number of questions you need before a pattern emerges.
This teased out of the business that we needed more than just a graph database (like Neo4J) which they seem for some reason to have fixated on at one point. We also needed the batch processing that GraphX gave us.
[2] The Fourth Paradigm: Data-Intensive Scientific Discovery, Tony Hey.