Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems
Cyberspace operations, in conjunction with artificial intelligence and machine learning enhanced cyberspace infrastructure, make it possible to connect sensors directly to shooters independent of human control. These technologies serve as the pivot around which cyber data from the military’s Internet of Battlefield Things, for example, will be turned into actionable insight and knowledge and, ultimately, an information advantage for the military. As such, network intrusion detection systems must detect, evaluate, and respond to malicious cyber traffic at machine speed. Generative adversarial networks and variational autoencoders are fit as generative models with labeled cyber data from a real military enterprise network. These generative models are used to create realistic, synthetic cyber data. A combination of real and synthetic cyber data sets are then used to train several machine learning models for network intrusion detection. Purely synthetic data is shown to be statistically similar to the real data. There is no statistically significant difference in the performance of classifiers trained with real data versus a combination of real and synthetic data; however, classifiers trained with only synthetic data underperformed. To avoid a decrease in intrusion detection performance, classifiers must be trained with at least 15% real data.
Marc Chalé, Nathaniel D. Bastian, Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems, Expert Systems with Applications, Volume 207, 2022, 117936, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2022.117936.