Thursday, June 13, 2019

The data science of cyber-squatting


The Apache Metron project has an interesting approach to detecting cyber-squatters that can be found here. Briefly, it generates lots of names that are similar to your domain using a tool called DNS Twist.

They then use Bloom Filters to check each domain name coming through in the stream is not on this black-list. Bloom Filters cannot say that something definitely is in a collection but can say when something definitely is not. The trade-off is that they consume very little memory. And for this use case, we're not interested whether our domain is not on a black-list, only whether it is.

Bloom Filters do this by using k algorithms that given the data generate a number between 0 and m. All of these numbers then becomes indices in an m-sized bit map and we flip that but to 1. Now, we have some new data come through and we ask, have we seen this before? Well, we run those same k algorithms and if each one gives us and index that has 0 in the corresponding array, we know that this datum is definitely not on our black-list.

My approach was different. Given the domains we own (over 900  of them), I generate n-grams of characters and create a distribution of their frequency. For me, n for is {2,3} but your mileage may vary.

The same calculation is performed on each domain as it flies by and compared (using Jensen-Shannon) with the 900 or so distributions representing our real estate. A score of 0 obviously means that it's one of ours. A score much greater means it's very different. But a score close to 0 is worrying.

How close before you raise alarms depends on your data and your tuning. During testing, the smallest, non-zero domain was "www-OUR_DOMAIN" rather than "www.OUR_DOMAIN". This was a known spear-phishing scam so clearly our algorithm works to some degree. What's more, DNS Twist does not seem to have anticipated this attack.

No comments:

Post a Comment