SVD-based Universal DNN Modeling for Multiple Scenarios

Speech recognition scenarios (aka tasks) differ from each other in acoustic transducers, acoustic environments, and speaking style etc. Building one acoustic model per task is one common practice in industry. However, this limits training data sharing across scenarios thus may not give highest possible accuracy. Based on the deep neural network (DNN) technique, we propose to build a universal acoustic model for all scenarios by utilizing all the data together. Two advantages are obtained: 1) leveraging more data sources to improve the recognition accuracy, 2) reducing substantially service deployment and maintenance costs. We achieve this by extending the singular value decomposition (SVD) structure of DNNs. The data from all scenarios are used to first train a single SVD-DNN model. Then a series of scenario-dependent linear square matrices are added on top of each SVD layer and updated with only scenario-related data. At the recognition time, a flag indicates different scenarios and guides the recognizer to use the scenario-dependent matrices together with the scenario-independent matrices in the universal DNN for acoustic score evaluation. In our experiments on Microsoft Winphone/Skype/Xbox data sets, the universal DNN model is better than traditional trained isolated models, with up to 15.5% relative word error rate reduction.