Detecting malware domains by syntax heuristics
February 1, 2012
An important challenge we face when feeding our Open Source IP Reputation System is to differentiate between real threats and false positives.
However, nothing in the universe is black or white. Each IP in the database has a reliability value from 1 to 10. That’s because in some special scenarios, an IP can be good and bad at the same time (e.g. shared hostings with malware or dynamic IPs).
When we had a bunch of malicious domains pointing to malware IPs, we realized that most of them had something in common. Take a look at these 10 domains:
They are similar in some way, aren’t they?
This list of domains has been generated with downatool2, a tool that emulates the Conficker’s C&Cs domain generator. Several pieces of malware and bad guys are registering domains doing more or less the same, with similar algorithms and results.
How do we distinct these kinds of malware domains against the rest only with syntax analysis?
We need to keep in mind that detecting false domains with 100% accuracy is really quite difficult. We do not know if the domain is pointing to a legitimate site, and it is hard to design a perfect algorithm to match malware domains for all generators without some legitimate sites with weird domains getting swept up by mistake.
Within part of our IP Reputation Engine, we’ve developed an algorithm that can check good detections / false positives with acceptable ratio. Simply put, it’s a Python library attached at the end of the article, along with more stuff.
In the named domains we can see that they have a lot of consonant characters and only a few vocal characters (uzabfgqfk.my). This isn’t common in normal domains. To develop the algorithm, the first thing we did was study how many consonant characters followed has a domain, and if it has more than X, mark it as a malware domain.
After that, we realized that removing common words (like “and”, “or”, “page”, “free”, ...) and then doing the same check could improve the detection ratio. This is because generated domains do not include human words. It helped us to quit some false positives too.
In the real world.
We are going to generate a huge list of possible malware domains with downatool2 and test the code with it.
$ ./check_domain_heur.py domains_conficker.txt
30696 / 50000
61 per cent matched
We have a 61% of domains detected as malicious only with syntax analysis. It isn’t bad, but what about false positives?
Alexa bring us a big list with the most visited websites. We can presume that none of them is a malware domain, let’s try.
$ ./check_domain_heur.py alexa.txt
104643 / 1000000
10 per cent matched
With this approach, we have a 10% of false positives. It isn’t perfect, but a 61% of successful matches against a 10% of false positives is, for now, quite respectable. We are still working to improve the algorithm and the list of common words as much as we can. Please note that currently, this should be considered more as a proof of concept than a stable release.
You can download the code from here.