Feature Extraction and Detection of Malwares Using Machine Learning

Last edited: January 1, 1970

Malicious softwares, or malwares in short, represent a growing threat on the information systems worldwide. Throughout years cyber-security specialists have been using traditional signature-based techniques as a form of statistical analysis to detect malwares, however, the exponential evolution and the constant changes in their nature made these classic methods inefficient, announcing the entrance of cyber-security into data-driven solutions.

Triggered by these problems, my team and I started the Malware Revealer project to both target the lack in data tools for building on-demand datasets from binary files and push the machine learning researches in the field of malware detection.

Solving malware detection's problem using machine learning follows a certain pipeline illustrated in the above figure. The training phase starts with the collection of benign and malware binaries to be used for training our ML model, in our case we used benign files from fresh Windows installation and malwares provided by VirusTotal, those binary files are processed by a feature extractor which basically extracts characteristics about the binaries (e.g file size, section's information), this is what actually will be fed to the model during the training. Then comes the training phase which may use different algorithms, this phase will produce a predictive model, that given an unknown binary file, can tell if it's a malware or a benign. Moving to the prediction phase, we make use of our predictive model to draw inferences, the process is similar to the training phase as both of them starts by extracting features, however, this time instead of training the model using those features, we will predict the nature of the unknown binary.

Malware Revealer is playing a role during the extraction, training and prediction phases. It provides a modular and extensible extractor to extract the features you need or even add them easily. You can also find training notebooks to see how we trained our ML models. We have also implemented an application to provide an API for making both predictions and extractions. A client app that make use of the APIs can be found here.

Malware Revealer aims to give the cyber-security community a real functional tool to overcome the lack in datasets and also a functional tool of static malware analysis using a machine learning approach. It is functional on PE files for now, however, the spectrum of this project is a long term work that aims to introduce dynamic features analysis with sandboxes, normalizing the features from most common format to a single feature set so that we can cover a wide variety of file formats as malicious files are not just PE files, malwares live as ELFs, APKs ...

Malware Revealer is eager to see new contributors :) so don't hesitate to make some PRs or open an issue if you have any idea.

Team