Agile Java Man: Spotting Dodgy Domain Names

These are various approaches employing machine learning to differentiate between good domain names and bad ones. By bad, I mean domains that are used to trick people into thinking they're clicking on a legitimate address (www.goog1e.com, for instance).

Data

The data is the top 1 million domain names as recorded by Cisco. You can find it here.

The data was stripped of the top level domains to remove elements that were not useful.

Then, what is left was ~~one-hot encoded~~ converted to bigrams of characters leading to vectors of length 1444 (that is, 38 x 38 possible ASCII characters). The code for this lives here.

This data set was then split down a 95/5 ratio of training to holdout.

We created test data from this holdout data when we deliberately corrupted it. This was done by changing either a single 'o' to become an '0' or a single 'l' to become a '1'. If there were no such characters to corrupt, the data point was discarded.

Kullback-Leibler Results

We actually use a variant of KL divergence in these results that handles zeros in the data - the Jensen Shannon metric.

In the following histograms, red indicates bad domains and green good ones.

The tables represent the entire holdout ("Good") and the entire test ("Bad") data sets with their Jensen-Shannon metric calculated against the training data.

Note that these metrics are calculated by summing the columns of the data sets. This leads to something of an unrepresentative description of the data since the original is one-hot encoded. Therefore, in any subset of 38 elements of a real vector, only one can be 1 and all the rest must be 0. That is, the elements in a vector for a given domain are not independent of each other.

No normalisation

KL Score Histogram with no normalisation

Note that "+4.914e6" in the bottom right hand corner. Indeed the KL scores are close:

Class	KL Score
Good	4 319 810.24
Bad	4 169 380.40

There's a hair's breadth between them so this is probably going to be hard to differentiate the classes.

L1-Normalise everything

Here, the control group's KL was 6370.17 and the bad domains scored 6370.92 - very close. The histograms unsurprisingly look similar:

KL Score Histogram with all vectors L1-Normalised

Hmm, still not much to work with, so let's try combinations of the two. First:

Normalised Baselines, No Normalisation for Others

In this trial, the baseline is L1-normalised but the other vectors are not.

Class	KL Score
Good	97 285.45
Bad	139 889.62

The histogram for the holdout and bad domains now looks like:

KL Score Histogram with the baseline L1 normalised; all other vectors unnormalised

This is good. There are now two distinct distributions with separate peaks.

L2-normalisation gave very similar KL scores and a graph that looked like:

KL Score Histogram with the baseline L2-normalised; all other vectors unnormalised

Let's try:

Unnormalised Baseline, L1-Normalisation for Others

... and it looks like we're back to square one. The KL scores are amazingly close:

Class	KL Score
Good	4 914 404.73
Bad	4 914 407.38

so, not surprisingly are the distributions of the holdout and test data:

KL Score Histogram with the baseline unnormalized; all other vectors L1-normalized

Again, note that: +4.9144e6 in the bottom right hand corner.

So, we seem to be going backwards.

Aside: Normalise then sum baseline, all others unnormalised

I tried a few other variations like normalising then summing, first with L1:

Class

KL Score

Good

773 262.74

KL Score Histogram with the baseline L2-normalised then summed; all other vectors unnormalised

Bad

704 254.41

then L2:

Class

KL Score

Good

94 506.53

KL Score Histogram with the baseline L1-normalised then summed; all other vectors unnormalised

Bad

83 559.17

But their summed distributions didn't give me a difference in KL scores as good as the "Normalised Baselines, No Normalisation for Others" results, so I stuck with those and I discuss just that distribution in the next section.

The ROC

Running our model against the test data, the ROC looks likes somewhat underwhelming:

ROC for "Normalised Baselines, No Normalisation for Others" KL

Or, in 3d where we can see the threshold value:

where a threshold value of about 10 is the closest the curve comes to the top, left hand corner.

It seems clear that out tool cannot with great confidence determine if a domain name is suspect or not. But then could a human? Which of these domain names would you say are bogus and which are genuine?

mxa-00133b02.gslb.pphosted.com
m7.mwlzwwr.biz
live800plus.jp
lf.wangsu.cmcdn.cdn.10086.cn
x10.mhtjwmxf.com
mailex1.palomar.edu
3gppnetwork.org
mimicromaxfinallb-1513904418.ap-south-1.elb.amazonaws.com
mkt4137.com
modt1thf4yr7dff-yes29yy7h9.stream
jj40.com

This is of course a trick question. They're all genuine URLs that Cisco have logged.

However, if our tool is used as part of a suite of metrics it might identify nefarious activity.

Conclusion

Our tool is definitely better than the monkey score but can we improve it? I have a neural net that looks promising but is computationally very expensive. The KL calculations (and variants of them) are very fast and cheap. I'll compare them to a neural net solution in another post.

Agile Java Man

Tuesday, March 19, 2019

Spotting Dodgy Domain Names

No comments:

Post a Comment

Blog Archive

About Me