When generating the images for the dataset, researchers have to make important choices in two difficult areas: which type of available information to use as the basis to generate images and what technique, or techniques, to employ to generate those images. These two areas are largely independent but can be influence each other.
There are three distinct macro approaches to obtain information from a malware sample: static visualization, dynamic visualization, and hybrid visualization.
Macro Approaches
- Static visualization of malware entails representing the content of malicious software, either in the form of hexadecimal binary code or source code written in assembly format.
(icon1)
- Dynamic-based visualization techniques involve executing files in a sandbox environment to extract relevant features representing malware behavior. Dynamic features such as API calls, memory dumps, and network traffic are widely used for malware visualization.
(icon2)
- Hybrid visualization aims to harness the advantages of both static and dynamic techniques.
(icon3)
Specific Techniques
- Raw Bytecode Images: These techniques take the binary malware files and use it as the starting point to generate an image. It gathers binary malware files, interpreting them as sequences of 8-bit unsigned integers, then depicted in grayscale. Each integer value corresponds to a shade ranging from 0 (representing black) to 255 (representing white) in the resulting images. Nataraj et al. employed byte values from binary executables as grayscale values to generate images. The B2IMG conversion algorithm, introduced by Tekerek et al., offers a variant suitable for samples obtained with hex dumps. Similarly, opcodes can be interpreted as integers, focusing on the most common ones. The raw bytecode method, widely prevalent in literature, excels in versatility, processing diverse malware file information statically.
- Run-time Information for Image Generation: This methods involve executing malware within a secure environment or sandbox and utilizing collected data to generate images. This dataset can contain network communication logs, file access, API calls, return values, call times, process numbers, system calls, resource consumption, registry keys, memory artifacts, etc. The most common technique monitors system calls. Shiva et al. assign a hexadecimal code to each extracted API call, resulting in a byte sequence interpreted as a grayscale image. Using Run-time information enables researchers to use dynamic analysis as a starting point for visualization, with pros and cons similar to standard dynamic malware analysis.
- Metadata Information Images: Metadata refers to data embedded in the executable file that describes the information about the files themselves. It contains details regarding executable files’ composition, contents, and operations to facilitate their proper loading, execution, and interaction by the operating system and other software utilities. Examples to locate metadata are the PE header for Windows executables or in the APK files for Android. This approach enhances understanding of the file structure for PE files by converting information like entry points, section addresses, and optional header details into visual elements. The use of metadata can help researchers insert in the images information that characterizes specifically certain types of malware, making the detection and classification easier using less data than the previously described information gathering techniques.
Image Generation
After obtaining the data needed, researchers create the samples’ image by using the following techniques:
- Space Filling Curves (SFC): Space-filling curves (SFC), mathematical constructs traversing and filling continuous spaces, are explored to overcome the locality problem by rearranging pixels in the image. A SFC is a method for transforming multi-dimensional space into a one-dimensional continuum. It traverses through every pixel within the multi-dimensional space, ensuring each cell is visited precisely once. Consequently, a space-filling curve establishes a linear sequence of points within the multi-dimensional space. The most intuitive technique to generate images, commonly known as “line-by-line” or “carriage return” method, has been used by the majority of researchers. Using this method, the extracted information is translated into pixels arranged linearly. Once the line reaches the desired length, the subsequent pixels commence a new line, culminating in a 2D image. Translating a binary file into an image using this method results in information loss and partially degrades locality. Arranging bytes in rows causes line breaks to separate originally adjacent bytes in the code. Example of other SFC are Z-order, Gray-code, Hilbert curves, wrap-around and H-curve. Although they outperform standard line-by-line methods, complete adherence to the locality principle in two-dimensional images remains challenging. SFCs, while an improvement, fall short of being a conclusive solution as they also have some limitations like (a) sacrificing contextual information inherent in the original data, (b) SFC inherently reduce the dimensionality, but this reduction can introduce biases and distortions, (c) as the dataset grows in size and complexity, computational demands for processing and visualizing with SFC’s may become overwhelming.
- Markov Images: Markov chains capture the statistical properties of pixel sequences forming patterns and, for this reason, can be used to tackle the locality problem by generating information-rich images. The produced images are optimal in representing information along a single dimension, which is how the malware binary itself is saved in memory. One approach to visually representing files is treating each byte in the file byte stream as a state, resulting in 256 possible states for each element. Assuming the next state depends solely on the current state, the byte sequence can be conceptualized as a Markov chain. This approach yields a matrix of dimensions 256 x 256, visually representing underlying malware characteristics. However, there are also some limitations to consider (a) conversion process from raw data to a Markov image might lead to some loss of information, and (b) selecting the right features impacts the ability to distinguish files
- Hashing Schemes: Utilizing specific hashing algorithms designed to produce similar hashes when given similar inputs can help mitigate the locality problem. This technique considers a small data window (e.g., a few bytes) and generates a hash based on that window. By analyzing the distribution of these hash values, researchers can identify regions where similar malware variants might reside. One particular algorithm, “Minhash”, can create a concise visual summary or fingerprint of malware samples. Another hashing algorithm used for the same effect is Simhash. Hashing algorithms are better at tackling the locality problem than the line-by-line method but still suffer from it.
- Entropy Images: Shannon entropy, a cornerstone of information theory, is a key metric for measuring information and uncertainty. In the realm of malware analysis, researchers extensively leverage entropy to scrutinize the tactics employed by malware authors for code obfuscation or concealment. Notably, packed or obfuscated code manifests higher entropy compared to its normal counterpart, offering a discernible pattern that can be exploited to distinguish between obfuscated and non-obfuscated code in samples. This distinctive entropy stream encapsulates a shared pattern within a malware family, motivating a more in-depth analysis of the entropy stream, and consequently the entropy graph, to extract pertinent features for effective malware identification.
- Colored Images: The use of RGB and other colored images for detection leverages their capacity to convey richer texture data compared to grayscale counterparts, thus enhancing detection rates. RGB images offer versatility by encoding diverse representations across the three color channels. Two main strategies for RGB image generation exist: one involves utilizing RGB channels with previously discussed techniques, while the other utilizes a single technique and represents it as an RGB image, often utilizing colormaps. Another approach involves overlaying various binary sections to resize images consistently. Though this method doesn’t augment feature extraction or classification, it facilitates resizing to a standardized dimension.
Examples of Generated Images with Different Techniques