Jack W. Stokes, John C. Platt, Joseph Kravis, and Michael Shilman
This paper proposes using active learning combined with rare class discovery and uncertainty identification to statistically train a network traffic classifier. For ingress traffic, a classifier can be trained for a network intrusion detection or prevention system (IDS/IPS) while a classifier trained on egress traffic can detect malware on a corporate network. Active learning selects interesting traffic to be shown to a security expert for labeling. Unlike previous statistical misuse or anomaly-detection-based approaches to training an IDS, active learning substantially reduces the number of labels required from an expert to reach an acceptable level of accuracy and coverage. Our system defines nteresting traffic in two ways, based on two goals for the system. The system is designed to discover new categories of traffic by showing examples of traffic for the analyst to label that do not fit a pre-existing model of a known category of traffic. The system is also designed to accurately classify known categories of traffic by requesting labels for examples which it cannot classify with high certainty. Combining these two goals overcomes many problems associated with earlier anomaly-detection based IDSs. Once trained, the system can be run as a fixed classifier with no further learning. Alternatively, it can continue to learn by labeling data on a particular network. In either case, the classifier is efficient enough to run in real-time for an IPS. We tested the system on the KDD-Cup-99 Network Intrusion Detection dataset, where the algorithm identifies more rare classes with approximately half the number of labels required by previous active learning based systems. We have also used the algorithm to find previously unknown malware on a large corporate network from a set of firewall logs.