Generating genetic engineering linked indicator datasets for machine learning classifier training in biosecurity
As methods and access to gene synthesis and genetic engineering have become more advanced, the fear that malicious viruses and bacteria will be designed with the express intention of causing harm to humans has received increased attention. In the event that such biological weapons are deployed, the security community needs tools to rapidly recognize the threat and identify responsible parties. Therefore, a key question is whether or not a biological threat is manmade. Currently, experts are capable of qualitatively assessing whether specific genetic sequences are natural or man-made, but few objective criteria exist for characterizing the degree to which a sequence has been engineered. Additionally, progress has recently been made on the task of attributing an engineered gene sequence to a lab-of-origin using machine learning. However, the task of analyzing naturally occurring genetic sequences so as to automatically detect outliers that may have been genetically engineered has received comparatively little attention. This work proposes a method for generating a dataset of natural and engineered sequences that can be used as an input for training machine learning classifiers to perform automatic detection of human engineering in gene sequence data.
Christopher Painter and Nathaniel D. Bastian "Generating genetic engineering linked indicator datasets for machine learning classifier training in biosecurity", Proc. SPIE 11746, Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications III, 1174624 (12 April 2021); https://doi.org/10.1117/12.2587844