C. Chelba and Alex Acero
July 2004
A novel technique for maximum “a posteriori”
(MAP) adaptation of maximum entropy (MaxEnt)
and maximum entropy Markov models (MEMM) is
presented.
The technique is applied to the problem of recovering
the correct capitalization of uniformly cased
text: a “background” capitalizer trained on 20Mwds
of Wall Street Journal (WSJ) text from 1987 is
adapted to two Broadcast News (BN) test sets —
one containing ABC Primetime Live text and the
other NPR Morning News/CNN Morning Edition
text —from 1996.
The “in-domain” performance of the WSJ capitalizer
is 45% better than that of the 1-gram baseline,
when evaluated on a test set drawn from WSJ
1994. When evaluating on the mismatched “out-ofdomain”
test data, the 1-gram baseline is outperformed
by 60%; the improvement brought by the
adaptation technique using a very small amount of
matched BN data—25-70kwds—is about 20-25%
relative. Overall, automatic capitalization error rate
of 1.4% is achieved on BN data.
![]() PDF file |
In Proc. of EMNLP
| Type | Inproceedings |