本論文では,歌声の基本周波数(F0)と母音音素を同時に推定可能な新たな手法について述べる.本手法は,F0と母音音素だけでなく,歌手名や性別などの要素も同時に推定できるように設計されているため,混合音中の歌声を認識するための新たなフレームワークと考えることができる.本手法は,歌声とその他の伴奏音が混ざった状態を,歌声を分離するのではなく,そのままの形で統計的にモデル化する.また,信頼性の高い歌声のスペクトル包絡を推定するために,様々なF0を持つ複数の音の調波構造を使用する.F0と母音音素の同時推定を,ポピュラー音楽6歌手10曲で評価した結果,提案法によりF0推定の性能が平均3.7ポイント,音素推定の性能が平均6.2ポイント向上することを確認した.A novel method is described that can be used to concurrently recognize the fundamental frequency (F0) and vowel phoneme of a singing voice (vocal) in polyphonic music. This method can be considered as a new framework for recognizing a singing voice in polyphonic music because it is designed to concurrently recognize other elements of a singing voice including singer's name and gender, though this paper focuses on the F0 and vowel phoneme. Our method stochastically models a mixture of a singing voice and other instrumental sounds without segregating the singing voice. It can also estimate a reliable spectral envelope by estimating it from the harmonic structure of many voices with various F0s. The experimental results of F0 and phoneme recognition with 10 popular-music songs by 6 singers showed that our method improves the recognition accuracy by 3.7 points for F0 estimation and 6.2 points for the phoneme recognition.
本稿では,伴奏音を含む音楽音響信号と対応する歌詞の時間的な対応付け手法について述べる.クリーンな音声信号とその発話内容の時間的対応付けを推定をするViterbi アラインメント手法はこれまでも存在したが,歌声と同時に演奏される伴奏音の悪影響で市販 CD 中の歌声には適用できなかった.本稿では,この問題を解決するため,歌声の調波構造を抽出・再合成することで混合音中の歌声を分離する手法,歌声・非歌声状態を行き来する隠れマルコフモデル (HMM)を用いた歌声区間検出手法,音響モデルを分離歌声に適応させることで Viterbi アラインメントを適用する手法を提案する.日本語のポピュラー音楽を用いた評価実験を行い,本手法により10曲中8曲について十分な精度で音楽と歌詞の対応付けが出来ることを確かめた.This paper describes a method that can automatically synchronize between polyphonic musical audio signals and corresponding lyrics. Although there were methods that can synchronize between monophonic speech signals and corresponding text transcriptions by using Viterbi alignment techniques, they cannot be applied to vocals in CD recordings because accompaniment sounds often overlap with vocals. To align lyrics with such vocals, we therefore developed three methods: a method for segregating vocals from polyphonic sound mixtures by extracting and resynthesizing the vocal melody, a method for detecting vocal sections using a Hidden Markov Model (HMM) that transitions back and forth between vocal and non-vocal state, and a method for adapting a speech-recognizer phone model to segregated vocal signals. Experimental results for 10 Japanese popular-music songs showed that our system can synchronize between music and lyrics with satisfactory accuracy for 8 songs.