Feature engineering combined with machine learning and rule-based methods for structured information

Yan Xu, Kai Hong, Junichi Tsujii, and Eric Chang

Abstract

Objective: A system that translates narrative text in the

medical domain into structured representation is in great

demand. The system performs three sub-tasks: concept

extraction, assertion classification, and relation

identification.

Design: The overall system consists of five steps:

  1. Pre-processing sentences, (2) marking noun phrases

(NPs) and adjective phrases (APs), (3) extracting

concepts that use a dosage-unit dictionary to

dynamically switch two models based on Conditional

Random Fields (CRF), (4) classifying assertions based on

voting of five classifiers, and (5) identifying relations

using normalized sentences with a set of effective

discriminating features.

Measurements: Macro-averaged and micro-averaged

precision, recall and F-measure were used to evaluate

results.

Results The performance is competitive with the stateof-

the-art systems with micro-averaged F-measure of

0.8489 for concept extraction, 0.9392 for assertion

classification and 0.7326 for relation identification.

Conclusions: The system exploits an array of common

features and achieves state-of-the-art performance.

Prudent feature engineering sets the foundation of our

systems. In concept extraction, we demonstrated that

switching models, one of which is especially designed for

telegraphic sentences, improved extraction of the

treatment concept significantly. In assertion

classification, a set of features derived from a rule-based

classifier were proven to be effective for the classes

such as conditional and possible. These classes would

suffer from data scarcity in conventional machinelearning

methods. In relation identification, we use twostaged

architecture, the second of which applies

pairwise classifiers to possible candidate classes. This

architecture significantly improves performance.

Details

Publication typeArticle
Published inJournal of the American Medical Informatics Association
> Publications > Feature engineering combined with machine learning and rule-based methods for structured information