Thursday, April 4, 2019

Master of your domain name


Introduction

I've had moderate success using neural nets to distinguish between good domain names and domain names generated by an algorithm, something you often find in malware. However:

  1. Tuning neural nets is hard and requires esoteric knowledge.
  2. They're computationally expensive both to train (OK) and to use (not OK).
  3. The results I have so far are at best reasonable and no better.

So, using the same raw data I used in my neural net work, I tried to do better.

Jensen-Shannon, bigram population distribution

I used the SMaths library to create the Jensen-Shannon scores comparing the character distribution for both good and bad domain names against the distribution for all good names using 1-hot character encoding. Unfortunately, the results are poor:

JS scores for good (green) and bad (red) domain names

Just Shannon Entropy

What do you do when there the probability for an n-gram is zero? The common answer is "ignore the zero probabilities, and carry on summation using the same equation" (StackExchange).

But I found penalizing zeros gave me better powers to differentiate the two categories. The value of this penalty was empirically derived but roughly optimising it with a crude binary search gave me a much better spread of the distributions:

Penalized Shannon entropy scores for good (green) and bad (red) domain names
and consequently a great ROC curve:

The Data

There is, however, a sting in the tail. Some of the good domain names appear to be bad!

If we look at that ROC curve in 3d:
The same ROC curve in 3d showing the threshold's relationship with the curve
And we estimate that the best value for our threshold is roughly -0.018 to maximise catching bad domains with the minimum disruption from good domains that just look a bit suspicious.

Looking at the false negatives was illuminating. Here is a small sample:

dynserv
bowupem
ikrginalcentricem
gentlemanwritten
qmigfordlinnetavox
jfbjinalcentricem
osbumen
vxheellefrictionlessv

These show a level of sophistication greater than those generated by a DGA. They contain normal (or normal-sounding) words. Words like frictionless appear in a number of these domains, adding an air of authenticity. Indeed, gentlemanwritten.net sounds positively respectable.

The false positives were even more interesting. Here is a very small selection:

95a49f09385f5fb73aa3d1e994314a45b8d51f17
mhtjwmxf
wlhzfpgs
kztudyya

These look awfully suspicious. In fact, attempts to look them up with whois reveals nothing at all - which is odd.

Conclusion

Although we're getting better results with a less computationally expensive solution than neural nets, we always assumed that the training data was clean. In fact, it appears to be reasonably clean but not pristine. On the upside, this means that if we clean the data we can reasonably expect our false positive rate to go down even further.


No comments:

Post a Comment