Research in the field primarily focuses on a few datasets widely recognized as benchmarks, occasionally incorporating the original dataset to include more recent samples. The careful selection of datasets is essential as it lays the groundwork for rigorous and dependable research. The datasets frequently referenced in the literature are MalImg and Big2015. MalImg, introduced by Natraj et al. in 2011, is the oldest dataset in this context.
This dataset comprises 25 imbalanced families, totaling 9,339 malware samples, all provided in grayscale image form. The image generation follows the line-by-line technique detailed in Section X. The state-of-the-art results for MalImg hover around 99%.
| Family | Number of Samples | Class |
|---|---|---|
| Allaple.L | 1591 | Worm |
| Allaple.A | 2949 | Worm |
| Yuner.A | 800 | Worm |
| Lolyda.AA1 | 213 | PWS |
| Lolyda.AA2 | 184 | PWS |
| Lolyda.AA3 | 123 | PWS |
| C2LOP.P | 146 | Trojan |
| C2LOP.gen!g | 200 | Trojan |
| Instantaccess | 431 | Dialer |
| Swizzor.gen!I | 132 | TDownloader |
| Swizzor.gen!E | 128 | TDownloader |
| VB.AT | 408 | Worm |
| Fakerean | 381 | Rogue |
| Alueron.gen!J | 198 | Trojan |
| Males.gen!J | 136 | Trojan |
| Lolyda.AT | 159 | PWS |
| Adialer.C | 125 | Dialer |
| Wintrim.BX | 97 | TDownloader |
| Dialplatform.B | 177 | Dialer |
| Dontovo.A | 162 | TDownloader |
| Obfuscator.AD | 142 | TDownloader |
| Agent.FYI | 116 | Backdoor |
| Autorun.K | 106 | Worm |
| Rbot!gen | 158 | Backdoor |
| Skintrim.N | 80 | Trojan |
Big2015, developed by Microsoft, consists of 9 unbalanced families, amounting to around 20,000 samples. It’s noteworthy that, in the original challenge, researchers received 10,860 samples, while the remaining were utilized as a withheld test set. In the scrutinized papers, most authors exclusively employ the initial 10,860 samples as a training set, further dividing them into requisite subsets for training and testing. The provided samples are executable PE files without the PE header. The state-of-the-art results for Big2015 are approximately 99%.
| Family | Number of Samples | Class |
|---|---|---|
| Ramnit | 1541 | Worm |
| Lollipop | 2478 | Adware |
| Khelios_ver3 | 2942 | Backdoor |
| Vundo | 475 | Trojan |
| Simda | 42 | Backdoor |
| Tracur | 751 | Downloader |
| Khelios_vr1 | 398 | Backdoor |
| Obfuscator.ACY | 1228 | Obfuscate |
| Gatak | 1013 | Backdoor |
In the realm of Android, the commonly employed datasets include Malgenome, Drebin, AndroZoo, and Contagio Mobile. Additionally, authors often utilize supplementary datasets such as those offered by the Canadian Institute for Cybersecurity (CIC), namely Ember2018 and Dumpware10. Alternatively, they may construct their own datasets by tapping into malware repositories such as VirusTotal, MalwareBazaar, VirusShare, or VX-Underground. The utilization of malware repositories comes with both advantages and disadvantages. Leveraging recently acquired samples facilitates the creation of a dataset tailored to authors’ specific requirements, yet the outcomes obtained may lack comparability with other models. Hence, a more prudent approach involves testing the model on benchmark datasets like Big2015 or MalImg, alongside an original dataset fashioned from fresh samples.