nグラム統計によるコーパスからの未知語抽出

4 0 0 0 nグラム統計によるコーパスからの未知語抽出

著者: 森信介長尾眞
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL)
巻号頁・発行日: vol.1995, no.69, pp.7-12, 1995-07-20
被引用文献数: 14

自然言語処理において、辞書は単語の文法的機能や意味の情報源として必要不可欠であり、辞書に登録されていない単語を減少させるため、辞書の語彙を増強する努力がなされている。新語や専門用語は絶えず増え続けているため、辞書作成の作業は多大な労力を要するのみならず、各解析段階での未知語との遭遇は避けらず、大きな問題の一つとなっている。この問題を解決するため、本論文では、nグラム統計を用いて、コーパスからの単語の抽出とその単語が属する品詞の推定を同時に行なう方法を提案する。この方法は、同一品詞に属する単語の前後に位置する文字列の分布は類似するという仮定に基づく。実験の結果、本手法が未知語の品詞推定や辞書構築に有効であることが確認された。Dictionaries are indispensable for NLP as a source of information of grammatical functions or meanings of words. Much endeavor is being made to reinforce their vocabulary. Given continuous increase of new words or technical terms, building a dictionary takes vast effort and unknown words are inevitable at any step of analysis and this causes a grand problem. To solve this problem, we propose a method to extract words from a corpus and estimate part-of-speeches (POSs) which they belong to simultaneously using n-gram statistics, based on the supposition that distributions of strings preceding or following words belonging to the same POS are similar. Experiments have shown that this method is effective to infer the POS of unknown words and build a dictionary.

4 0 0 0 nグラム統計によるコーパスからの未知語抽出

言及状況

はてなブックマーク (1 users, 1 posts)

Twitter (3 users, 3 posts, 1 favorites)

収集済み URL リスト