These are various approaches employing machine learning to differentiate between good domain names and bad ones. By bad, I mean domains that are used to trick people into thinking they're clicking on a legitimate address (www.goog1e.com, for instance).
Data
The data is the top 1 million domain names as recorded by Cisco. You can find it here.
The data was stripped of the top level domains to remove elements that were not useful.
Then, what is left was
This data set was then split down a 95/5 ratio of training to holdout.
We created test data from this holdout data when we deliberately corrupted it. This was done by changing either a single 'o' to become an '0' or a single 'l' to become a '1'. If there were no such characters to corrupt, the data point was discarded.
Kullback-Leibler Results
We actually use a variant of KL divergence in these results that handles zeros in the data - the Jensen Shannon metric.
In the following histograms, red indicates bad domains and green good ones.
The tables represent the entire holdout ("Good") and the entire test ("Bad") data sets with their Jensen-Shannon metric calculated against the training data.
Note that these metrics are calculated by summing the columns of the data sets.
KL Score Histogram with no normalisation |
Class | KL Score |
Good | 4 319 810.24 |
Bad | 4 169 380.40 |
There's a hair's breadth between them so this is probably going to be hard to differentiate the classes.
L1-Normalise everything
Here, the control group's KL was 6370.17 and the bad domains scored 6370.92 - very close. The histograms unsurprisingly look similar:
KL Score Histogram with all vectors L1-Normalised |
Normalised Baselines, No Normalisation for Others
In this trial, the baseline is L1-normalised but the other vectors are not.
Class | KL Score |
Good | 97 285.45 |
Bad | 139 889.62 |
The histogram for the holdout and bad domains now looks like:
KL Score Histogram with the baseline L1 normalised; all other vectors unnormalised |
L2-normalisation gave very similar KL scores and a graph that looked like:
KL Score Histogram with the baseline L2-normalised; all other vectors unnormalised |
Unnormalised Baseline, L1-Normalisation for Others
... and it looks like we're back to square one. The KL scores are amazingly close:
Class | KL Score |
Good | 4 914 404.73 |
Bad | 4 914 407.38 |
so, not surprisingly are the distributions of the holdout and test data:
KL Score Histogram with the baseline unnormalized; all other vectors L1-normalized |
So, we seem to be going backwards.
Aside: Normalise then sum baseline, all others unnormalised
I tried a few other variations like normalising then summing, first with L1:
Class | KL Score | |||
Good | 773 262.74 |
| ||
Bad | 704 254.41 |
then L2:
Class | KL Score | |||
Good | 94 506.53 |
| ||
Bad | 83 559.17 |
But their summed distributions didn't give me a difference in KL scores as good as the "Normalised Baselines, No Normalisation for Others" results, so I stuck with those and I discuss just that distribution in the next section.
The ROC
Running our model against the test data, the ROC looks likes somewhat underwhelming:
Or, in 3d where we can see the threshold value:
where a threshold value of about 10 is the closest the curve comes to the top, left hand corner.
It seems clear that out tool cannot with great confidence determine if a domain name is suspect or not. But then could a human? Which of these domain names would you say are bogus and which are genuine?
mxa-00133b02.gslb.pphosted.com
m7.mwlzwwr.biz
live800plus.jp
lf.wangsu.cmcdn.cdn.10086.cn
x10.mhtjwmxf.com
mailex1.palomar.edu
3gppnetwork.org
mimicromaxfinallb-1513904418.ap-south-1.elb.amazonaws.com
mkt4137.com
modt1thf4yr7dff-yes29yy7h9.stream
jj40.com
This is of course a trick question. They're all genuine URLs that Cisco have logged.
However, if our tool is used as part of a suite of metrics it might identify nefarious activity.
Conclusion
Our tool is definitely better than the monkey score but can we improve it? I have a neural net that looks promising but is computationally very expensive. The KL calculations (and variants of them) are very fast and cheap. I'll compare them to a neural net solution in another post.
The ROC
Running our model against the test data, the ROC looks likes somewhat underwhelming:
ROC for "Normalised Baselines, No Normalisation for Others" KL |
Or, in 3d where we can see the threshold value:
where a threshold value of about 10 is the closest the curve comes to the top, left hand corner.
It seems clear that out tool cannot with great confidence determine if a domain name is suspect or not. But then could a human? Which of these domain names would you say are bogus and which are genuine?
mxa-00133b02.gslb.pphosted.com
m7.mwlzwwr.biz
live800plus.jp
lf.wangsu.cmcdn.cdn.10086.cn
x10.mhtjwmxf.com
mailex1.palomar.edu
3gppnetwork.org
mimicromaxfinallb-1513904418.ap-south-1.elb.amazonaws.com
mkt4137.com
modt1thf4yr7dff-yes29yy7h9.stream
jj40.com
This is of course a trick question. They're all genuine URLs that Cisco have logged.
However, if our tool is used as part of a suite of metrics it might identify nefarious activity.
Conclusion
Our tool is definitely better than the monkey score but can we improve it? I have a neural net that looks promising but is computationally very expensive. The KL calculations (and variants of them) are very fast and cheap. I'll compare them to a neural net solution in another post.
No comments:
Post a Comment