Pattern recognition and machine learning techniques for cyber security

Shaukat, Kamran

Title: Pattern recognition and machine learning techniques for cyber security
Creator: Shaukat, Kamran
Relation: University of Newcastle Research Higher Degree Thesis
Resource Type: thesis
Date: 2023
Description: Research Doctorate - Doctor of Philosophy (PhD)
Description: The world of cyber security is continually threatened by the evolution and increasing sophistication of malware variants, presenting an urgent need to develop more advanced, efficient, robust and trustworthy malware detection methods. Traditional techniques have proven inadequate, as they often struggle to adapt and respond to new and emerging threats. Traditional anti-malware approaches require domain experts, have overhead of feature extraction, and are resource-intensive and time-consuming. This highlights the necessity for a dynamic and adaptable approach to malware detection that can respond to the shifting landscape of cyber threats. The advent of artificial intelligence (AI) and machine learning (ML) techniques has paved the way to fulfil this gap by providing automated, reliable and robust solutions for malware detection. Despite these advantages, several challenges, such as adversarial attacks, zero-day and evolving attacks, and imbalanced data issues, need to be addressed for machine learning to get its full potential for malware detection. To address these pressing issues, this study proposes novel, robust, efficient malware detection solutions that integrate deep learning with machine learning to detect malware. This study has performed a detailed systematic literature review of 312 articles that use machine learning techniques for cyber security. In addition, a comparative performance analysis of state-of-the-art machine learning techniques for cyber security is conducted. The literature review and comparative analysis demonstrated that learning-based techniques outperformed conventional solutions and heightened the key challenges of learning-based solutions in this domain. These challenges establish a base for the research objectives of this study. First, this study has developed a novel malware detector against adversarial attacks. Learning-based malware detectors are vulnerable to adversarial attacks. An adversarial attack manipulates a file so that the resulting file evades being detected. This study has developed a robust malware detector using adversarial training to combat these adversarial attacks. In adversarial training, data samples are manipulated using an adversarial attack to generate the poisoned samples. The proposed detector is trained with these poisoned samples to make them hard and robust again evasion attacks (i.e., adversarial attacks at the testing time). To consider the characteristics of multiple attacks, the proposed detector is trained with a mixture of multiple attacks. The performance of the proposed detector is compared with the 10 state-of-the-art malware detectors. The results demonstrated that the robustness of the proposed detector has significantly enhanced compared to the state-of-the-art approaches. Furthermore, the generalisation of the proposed detector is validated on other benchmark datasets. The proposed detector has outperformed by achieving the lowest evasion rate (a.k.a. adversarial loss, which is the percentage of dataset misclassified successfully by the respective adversarial attack) of 12% on the VirusShare and 18% on the VXHeaven dataset, respectively (the lower the evasion rate, the most robust the detector is against an adversarial attack). Second, this study has proposed a framework to handle the imbalance data issue. The method hinges on three key steps: initially, portable executable (PE) files are transformed into colour images, allowing for a visual representation of the data that is both intuitive and comprehensive. Following this transformation, deep features are extracted from these images using a fine-tuned deep learning model, capturing detailed information that can be used for subsequent analysis. The final step utilises a machine learning model to detect the malware, employing the extracted features to identify and classify potential malware. This study asserts that applying deep learning for feature extraction, rather than end-to-end classification, greatly optimises the system's efficiency. Extensive experiments are performed to analyse the effect of data augmentation to balance the data. In addition to presenting the methodology and its unique advantages, this study also provides an in-depth comparative analysis of the proposed framework and existing malware detection methods using 15 deep learning and 12 machine learning models. The framework is tested rigorously, and the results are contrasted with the performance of established methods, illustrating significant improvements in detection accuracy. The optimal combination of RegNetY320 for feature extraction and support vector machines for final detection yielded a remarkable malware detection accuracy of 99.06%. Third, considering the exponential growth of malware variants, this study has proposed a framework to detect zero-day, polymorphic and evolving malware attacks. The proposed framework has integrated deep learning and machine learning for malware detection. The framework further utilised the feature selection technique to reduce the computational power and make the detector more lightweight and efficient. The proposed approach used one-class SVM that constructed an enclosed hyper-sphere around the benign features. Anything outside the boundary is declared as an anomaly. The detection effectiveness of the proposed framework is compared with state-of-the-art deep learning models and existing work. The results demonstrated the superiority of the proposed approach in detecting zero-day and polymorphic attacks. The statistical test and ablation study further strengthened our claim and validated the significance of the proposed framework. This study also acknowledges the limitations inherent in the proposed frameworks. One of the primary challenges involves the high memory requirements for the temporary storage of features extracted using deep learning models. Furthermore, the impact of transformations on the PE file images and their potential implications for detection effectiveness remains uncertain. This study highlights these challenges as avenues for future research, suggesting exploring feature reduction and selection techniques, incremental learning, and specialised data balancing techniques. Despite the limitations, the significant improvements over existing methods and the potential avenues for future research make the proposed frameworks a promising step forward in the fight against malware. The effectiveness of the proposed framework in other contexts, such as Android and the Internet of Things (IoT) applications, is also an area of interest for future studies. Finally, the framework's ability to adapt to evolving malware threats makes it a potent tool for real-world applications, particularly in the defence industry, where efficient, robust, and effective malware detection is paramount.
Subject: machine learning; deep learning; cyberspace; convolutional neural network; pre-trained models; cybersecurity; cyber security; transfer learning; malware detection; malware; imbalanced dataset; adversarial attacks; cyber defence
Identifier: http://hdl.handle.net/1959.13/1514233
Identifier: uon:56829
Language: eng

Hits: 246
Visitors: 237
Downloads: 0