Monday, April 20, 2020

Graph databases and Data Collisions


Back in 2016, we were comparing Neo4J and OrientDB graph databases. I was tasked with exploring OrientDB and engaged the CEO, Luca Garulli, to help me bulk load a huge amount of data (our chat can be found here).

In both cases, there is a problem with inserting edges in that each edge insertion can update two vertices. Let's assume there are no edges connecting the same vertex to itself (there really isn't in our data), so each insertion updates exactly two vertices. Well, if two edges are inserted at the same time there is the possibility of a collision.  This is a version of the famous Birthday Problem.

We were advised by the OrientDB guys to batch our edge insertions. But what's a good batch size? A high number would mean greater throughput but also a higher chance of collision that would ultimately force us to retry.

Let's say there are N vertices that have already been written to the DB and our batch size for writing edges is b. What are the chances of a collision for a given batch of edges?

Firstly, we need to find the expected number of collisions within a batch of size b. These are OK as they won't force us to retry since they are part of the same transaction.

Here, we note a nice trick outlined here.
The expectation of a sum is the sum of the expectations. This theorem holds "always." The random variables you are summing need not be independent.
So, the expected number of collisions, E(X) is

E(X) = E(collisions for first vertex) + E(collisions for second vertex) + ...

The expected value of anything is the sum of all possible values multiplied by their probability. So, the expected number of collisions for each vertex is

E(collision of vertex x) = 0 * p(no collision) +  1 * p(collision)  = p(collision)

Note that the expected value need not be an integer nor indeed sensible. The expected number when throwing a 6-sided die is 3.5.

Given n vertices, this number becomes:

E(X) = bp

where the probability is given in the link above, that is

p(collision) = 1 - (1 - (1/N))b-1

For anything reasonable, this is pretty small. For instance, if there are 1 million vertices and we choose a batch size of 1000, the expected number of collisions within a batch is just less than 1.


No comments:

Post a Comment