Datasets

Research in the field primarily focuses on a few datasets widely recognized as benchmarks, occasionally incorporating the original dataset to include more recent samples. The careful selection of datasets is essential as it lays the groundwork for rigorous and dependable research. The datasets frequently referenced in the literature are MalImg and Big2015. MalImg, introduced by Natraj et al. in 2011, is the oldest dataset in this context.

MalImg

This dataset comprises 25 imbalanced families, totaling 9,339 malware samples, all provided in grayscale image form. The image generation follows the line-by-line technique detailed in Section X. The state-of-the-art results for MalImg hover around 99%.

Family Number of Samples Class
Allaple.L 1591 Worm
Allaple.A 2949 Worm
Yuner.A 800 Worm
Lolyda.AA1 213 PWS
Lolyda.AA2 184 PWS
Lolyda.AA3 123 PWS
C2LOP.P 146 Trojan
C2LOP.gen!g 200 Trojan
Instantaccess 431 Dialer
Swizzor.gen!I 132 TDownloader
Swizzor.gen!E 128 TDownloader
VB.AT 408 Worm
Fakerean 381 Rogue
Alueron.gen!J 198 Trojan
Males.gen!J 136 Trojan
Lolyda.AT 159 PWS
Adialer.C 125 Dialer
Wintrim.BX 97 TDownloader
Dialplatform.B 177 Dialer
Dontovo.A 162 TDownloader
Obfuscator.AD 142 TDownloader
Agent.FYI 116 Backdoor
Autorun.K 106 Worm
Rbot!gen 158 Backdoor
Skintrim.N 80 Trojan

Big2015

Big2015, developed by Microsoft, consists of 9 unbalanced families, amounting to around 20,000 samples. It’s noteworthy that, in the original challenge, researchers received 10,860 samples, while the remaining were utilized as a withheld test set. In the scrutinized papers, most authors exclusively employ the initial 10,860 samples as a training set, further dividing them into requisite subsets for training and testing. The provided samples are executable PE files without the PE header. The state-of-the-art results for Big2015 are approximately 99%.

Family Number of Samples Class
Ramnit 1541 Worm
Lollipop 2478 Adware
Khelios_ver3 2942 Backdoor
Vundo 475 Trojan
Simda 42 Backdoor
Tracur 751 Downloader
Khelios_vr1 398 Backdoor
Obfuscator.ACY 1228 Obfuscate
Gatak 1013 Backdoor

Other

In the realm of Android, the commonly employed datasets include Malgenome, Drebin, AndroZoo, and Contagio Mobile. Additionally, authors often utilize supplementary datasets such as those offered by the Canadian Institute for Cybersecurity (CIC), namely Ember2018 and Dumpware10. Alternatively, they may construct their own datasets by tapping into malware repositories such as VirusTotal, MalwareBazaar, VirusShare, or VX-Underground. The utilization of malware repositories comes with both advantages and disadvantages. Leveraging recently acquired samples facilitates the creation of a dataset tailored to authors’ specific requirements, yet the outcomes obtained may lack comparability with other models. Hence, a more prudent approach involves testing the model on benchmark datasets like Big2015 or MalImg, alongside an original dataset fashioned from fresh samples.