Voice Conversion with Neural Network

Sequence Error (SE) Minimization Training of Neural Network for Voice Conversion

Neural network (NN) based voice conversion, which employs a nonlinear function to map the features from a source to a target speaker, has been shown to outperform GMM-based voice version approach. However, there are still limitations to be overcome in NN-based voice conversion: NN is trained on a frame error (FE) minimization criterion and the corresponding weights are adjusted to minimize the error squares over the whole source-target, stereo training data set. In this paper, we use the idea of sentence optimization based, minimum generation error (MGE) training in HMM-based TTS synthesis, and modify the frame error (FE) minimization to Sequence Error (SE) minimization in NN training for voice conversion. The conversion error over a training sentence from a source speaker to a target speaker is minimized via a gradient descent-based back propagation (BP) procedure. Experimental results show that the speech converted by the NN, which is first trained with frame error minimization and then refined with sequence error minimization, sounds subjectively better than the converted speech by NN trained with frame error minimization only. Scores on both naturalness and similarity to the target speaker are improved.

Some samples(click to play)

Source       Target          FE                                        SE       

BDL

SLT

BDL to SLT

BDL to SLT

BDL

SLT

BDL to SLT

BDL to SLT

SLT

BDL

SLT to BDL

SLT to BDL

SLT

BDL

SLT to BDL

SLT to BDL

SLT

CLB

SLT to CLB

SLT to CLB

SLT

CLB

SLT to CLB

SLT to CLB

CLB

SLT

CLB to SLT

CLB to SLT

CLB

SLT

CLB to SLT

CLB to SLT

RMS

BDL

RMS to BDL

RMS to BDL

RMS

BDL

RMS to BDL

RMS to BDL

BDL

RMS

BDL to RMS

BDL to RMS

BDL

RMS

BDL to RMS

BDL to RMS