Looking at ways to solve my problem of agglomerative clusters when trying to de-dupe data, I turned to community detection algorithms.
Louvain Modularity is a heuristic that tells you about communities in a network (implementations found here). The original paper ("Fast unfolding of communities in large networks") can be found here. The first equations comes without much of an introduction so I scuttled off to find where it came from. The natural choice was a brilliant book by the equation's creator Networks: An Introduction by Mark Newman*.
Newman points out the difficulty in coming up with an algorithm that finds communities. If you want to minimise the number of edges crossing a boundary between two groups, then a naive algorithm will cheerfully put no nodes in one group and all N nodes in the other. So, you might want to tell the algorithm to weight the results by multiplying them by the product of the numbers in each group. This is heading in the right direction as a 0-N split will be the least favourable. In fact, with a bit of simple calculus, you can find that the maximum is (N/2). This is nice but arbitrary.
So, one metric we could use is the difference between the actual number of edges connecting nodes of the same class with the expected number.
The actual number of between classes ci and cj is given by:
Where δ(ci,cj) is the Kronecker Delta (which is simply 1 if i == j else 0), A is our adjacency matrix and the ½ comes because we don't want to double-count all the edges in the adjacency matrix.
For the expected number, imagine node i that has degree ki. If the graph has m edges, there are obviously 2m ends. Therefore, for any outgoing edge from i, the chances that the other side is node j is kj/2m and for all outgoing edges from i, the chance is kikj/2m.
Add them all up, and the expected number is:
|½||Σ||(ki kj / 2m) δ(ci,cj)|
where, again, the ½ prevents us from double-counting.
Now, we take the difference and divide by m (since we interested in fractions not absolute numbers) and we get:
|Q =||(1/2m)||Σ||(Aij - ki kj / 2m) δ(ci,cj)|
which is the equation in the paper.
* I really can't recommend this book enough. If you've got an undergraduate degree in something mathematically related, you'll find the equations no problem and they are presented in an easy-going manner.