This dataset is composed of side channel information (e.g., temperatures, voltages, utilization rates) from computing systems executing benign and malicious code. The intent of the dataset is to allow aritificial intelligence tools to be applied to malware detection using side channel information.
This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data: Top-1000 imported functions extracted from the 'pe_imports' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.
Malware and benign windows PE cuckoo reports
This dataset is part of my PhD research on malware detection and classification using Deep Learning. It contains static analysis data (PE Section Headers of the .text, .code and CODE sections) extracted from the 'pe_sections' elements of Cuckoo Sandbox reports. PE malware examples were downloaded from virusshare.com. PE goodware examples were downloaded from portableapps.com and from Windows 7 x86 directories.
This dataset is part of our research on malware detection and classification using Deep Learning. It contains 42,797 malware API call sequences and 1,079 goodware API call sequences. Each API call sequence is composed of the first 100 non-repeated consecutive API calls associated with the parent process, extracted from the 'calls' elements of Cuckoo Sandbox reports.
The summary page contains details that would otherwise be gathered from conducting static malware analysis. It highlights the file sizes, hashes, and more. The right side of the summary page shows a score that is assigned to the file based on how the tool deems it malicious. The score is graded from zero, which means the document/file is benign or harmless, to ten, for overly malicious files.
Cuckoo Sandbox also highlights specific details of the analysis, such as when the file was analyzed, time taken, and type of routing used. The summary page also shows interesting malware signatures in the further details section. File signatures have blue, red, and yellow color codes.Blue signature shows that the file is benign, yellow-coded files have medium risks, while signatures marked red mean that Cuckoo Sandbox has identified malicious activities, such as keylogging activity or leaking IP address.Cuckoo Sandbox has screenshots from the Guest device at the end of the summary page, which had the infected malware. These screenshots are useful in analyzing Ransomware since most ransom messages are displayed.
The network analysis page has multiple tabs that filter reports based on specific network traffic protocols. Malware analysts can filter network analysis reports to TCP, DNS, ICMP, IRC, UDP, and HTTP traffic generated by malware. The platform also allows analysts to download PCAP from the page.
The state-of-the-art antivirus recommends preventively extracting features of suspicious files before execution6. The executable file is submitted to the reverse engineering process. Then, the assembly code associated with the executable file can be studied so that the malicious intent of the suspicious file can be investigated. The features from the assembly code are used as input attributes to the artificial neural network used as the classifier. Neural network-based antiviruses accomplish an average performance of more than 90% in distinguishing benign and malware samples6.
Then, the malicious behaviors originating from the suspect files serve as input attributes to neural networks. In this work, we employ an mELM (morphological ELM) neural network, an ELM with a hidden-layer kernel, which is inspired by erosion and dilation image processing morphological operators. The proposed paper claims that the morphological kernel can fit any critical decision. Mathematical morphology refers to studying the shape of objects in an image by using the mathematical theory of intersection and union of the set8. Then, the morphological operation naturally processes the shape detection of the object present in the image8. By interpreting the boundary decision of the neural network as an n-dimensional image, where n is the number of extracted features, our morphological ELM can naturally detect and model the mapping to different types of n-dimensional regions in any machine learning. Authorial antiviruses achieve an average performance of 91.58% in distinguishing between benign and malware Java applications accompanied by an average training time of 52.36 seconds.
For the first VirusTotal possibility, the antivirus detects the maliciousness of the suspicious file. Within the proposed experimental environment, all submitted samples are malware documented by incident responders. The antivirus hits when it recognizes the malignancy of the investigated file. Malware detection shows that the antivirus offers robust support against digital invasions. In the second possibility, the antivirus certifies the benignity of the defined file. Then, in the proposed study, when the antivirus claims the file is benign, it is a false negative, as all the samples sent are malicious. In other words, the investigated file is malware; however, the antivirus mistakenly attests to it being benign. Within the third possibility, the antivirus does not give an analysis of the suspect application. The omission shows that the investigated file was never evaluated by the antivirus, so cannot be evaluated in real time. The omission of diagnosis by antivirus points to its limitation on large-scale services.
Malware identification ranged from 0 to 99.10%, depending on the antivirus. Overall, the 86 antiviruses identified 34.95% of the examined malware, with a standard deviation of 40.92%. The elevated standard deviation shows that recognizing malicious files can change abruptly depending on the chosen antivirus. The protection against digital intrusions is in the function of choosing a vigorous antivirus with an expansive and upgraded blacklist. Overall, antiviruses certified false negatives in 33.90% of the cases, with a standard deviation of 40.45%. Attention to the benignity of malware can be implicated in unrecoverable damages. A person or institution, for instance, may begin to trust a certain malicious application when, in fact, it is malware. Nevertheless, as an unfavorable aspect, approximately 31.39% of antiviruses did not express an opinion on any of the 998 malicious samples. On average, the antiviruses were omitted in 31.15% of the cases, with a standard deviation of 45.61%. The omission of the diagnosis focuses on the constraint of antivirus in recognizing malware in real time.
Kalash et al. (2018) proposed deep learning employing VGG-16 for malware classification20. Kalash et al. (2018) achieved an average accuracy of 98.52%. The network has an image input size of 224 \(\times\) 224. VGG-16 is a convolutional neural network that is 16 layers deep. The pretrained network can classify images into 1000 object categories, such as keyboards, mice, pencils, and many animals. The last 1000 fully-connected VGG-16 is replaced by a fully-connected softmax layer (benign, malware). Herein, we replicate the antivirus made by Kalash et al. (2018). Image input is created through a surface plot made by our dynamic feature extraction.
The present paper aims to elaborate the retrieval of JAR files applied to dynamic analysis (REJAFADA), a dataset that allows the classification of files with the JAR extension between benign and malware. REJAFADA consists of 998 malware JAR samples and 998 other benign JAR samples. The REJAFADA dataset, consequently, is suitable for learning endowed with artificial intelligence (AI), considering that the JAR files presented the same amount in the different classes (malware and benign). The goal is that classifiers that are biased toward a particular class do not have their success rates favored.
In relation to virtual viruses, REJAFADA extracted malicious JAR files from VirusShare, which is a repository of malware samples to provide security researchers, incident responders, forensic analysts, and the morbidly curious access to samples of live malicious code24. To catalog the 998 JAR malware samples, it was necessary to acquire and analyze, by authorial script, approximately 3 million malware samples from the reports updated daily by VirusShare.
With respect to benign JAR files, the catalog was given from application repositories such as Java2s.com25, and findar.com26. All of the benign files were audited by VirusTotal. Then, the benign JAR files contained in REJAFADA had their benevolence attested by the main commercial antiviruses of the world. The obtained results corresponding to the analyses of the benign and malware JAR files resulting from the VirusTotal audit are available for consultation at the REJAFADA virtual address11. The purpose of creating the REJAFADA dataset is to give the full possibility for the proposed methodology to be replicated by others in future work. Then, REJAFADA is freely available with all their samples, such as benign malware:
As a method of feature mining some behaviors audited by the sandbox, are ignored. The adopted mining criterion refers to feature elimination that concerns a single JAR file, for example, process IDs, process names, md5, and sha. After mining features, the relevant behaviors of the JAR samples serve as machine learning input attributes, specifically, artificial neural networks are employed as classifiers. The goal is to group the JAR samples into two classes: benign and malware. The classification stage is explained in detail in Sect. 5.2. Classification results are described in Chapter 6.
Our antivirus employs artificial neural networks as classifiers. To choose the best setting of the neural network architecture, we employ different learning functions and initial configurations to require a larger number of calculations, such as multiplying the number of neurons in the intermediary layer. The neural network architectures have an input layer containing many neurons regarding the vector of extracted features from the JAR file monitoring in a controlled environment. Therefore, the employed classifiers must have an input layer containing 6,824 neurons. They concern the features of auditing the JAR file. The output layer has two neurons, corresponding to benign and malware samples. 2ff7e9595c
Comments