PARALLEL DICTIONARY LEARNING USING A JOINT DENSITY RESTRICTED BOLTZMANN MACHINE FOR SPARSE-REPRESENTATION-BASED VOICE CONVERSION
In voice conversion, sparse-representation-based methods have recently been garnering attention because they are, relatively speaking, not affected by over-fitting or over-smoothing problems. In these approaches, voice conversion is achieved by estimating a sparse vector that determines which dictionaries of the target speaker should be used, calculated from the matching of the input vector and dictionaries of the source speaker. The sparse-representation-based voice conversion methods can be broadly divided into two approaches: (1) an approach that uses raw acoustic features in the training data as parallel dictionaries, and (2) an approach that trains parallel dictionaries from the training data. Our approach belongs to the latter; we systematically estimate the parallel dictionaries using a restricted Boltzmann machine, a fundamental technology commonly used in deep learning. Through voice conversion experiments, we confirmed the high-performance of our method, comparing it with the conventional Gaussian mixture model (GMM)-based approach, and a non-negative matrix factorization (NMF)-based approach, which is based on sparse representation.
voice conversion, restricted Boltzmann machine, sparse representation, parallel dictionary learning.