Automatically Extracting Fields from Unknown Network Protocols

  • Karthik Gopalratnam ,
  • ,
  • John Dunagan ,
  • Helen Wang

There are thousands of network protocols in active use on the internet. System administrators often need to extract information from particular fields in such protocols without having sufficient information or time to programatically parse the packets. We propose an active learning framework to perform this extraction in an unknown protocol, in which the user presents the system with a small number of labeled instances. Our system then automatically generates an abundance of features and negative examples; we then use a boosting approach for feature selection and classifier combination. The system then displays its results for the user to correct and/or add new examples and iterate. In our preliminary experiments on DNS queries and responses, we achieve less than 0.1% generalization error using only a handful of labeled examples and thus a minimum of user effort. This translates to perfect retrieval from 86% of unlabeled packets.