著者
萩原 正人 小川 泰弘 外山 勝彦
出版者
一般社団法人 人工知能学会
雑誌
人工知能学会論文誌 (ISSN:13460714)
巻号頁・発行日
vol.26, no.3, pp.440-450, 2011 (Released:2011-04-01)
参考文献数
32
被引用文献数
1 1

Extraction of named entitiy classes and their relationships from large corpora often involves morphological analysis of target sentences and tends to suffer from out-of-vocabulary words. In this paper we propose a semantic category extraction algorithm called Monaka and its graph-based extention g-Monaka, both of which use character n-gram based patterns as context to directly extract semantically related instances from unsegmented Japanese text. These algorithms also use ``bidirectional adjacent constraints,'' which states that reliable instances should be placed in between reliable left and right context patterns, in order to improve proper segmentation. Monaka algorithms uses iterative induction of instaces and pattens similarly to the bootstrapping algorithm Espresso. The g-Monaka algorithm further formalizes the adjacency relation of character n-grams as a directed graph and applies von Neumann kernel and Laplacian kernel so that the negative effect of semantic draft, i.e., a phenomenon of semantically unrelated general instances being extracted, is reduced. The experiments show that g-Monaka substantially increases the performance of semantic category acquisition compared to conventional methods, including distributional similarity, bootstrapping-based Espresso, and its graph-based extension g-Espresso, in terms of F-value of the NE category task from unsegmented Japanese newspaper articles.
著者
萩原 正人 小川 泰弘 外山 勝彦
出版者
一般社団法人情報処理学会
雑誌
情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
巻号頁・発行日
vol.2005, no.22, pp.71-78, 2005-03-11
参考文献数
16
被引用文献数
2

大規模コーパスから語の類似関係を得るためには,語の共起関係や文脈などの特徴を利用する方法が一般的である.しかし,語に関する表層的な特徴をそのまま用いる手法には,ノイズの混入やスパースネスなどの問題がある.本稿では,確率論・情報理論に基づく潜在意味モデルであるPLSIを用い,語の潜在意味を推定することによって名詞間の類似関係を求める.評価実験の結果,tf・idfやLSIなどの従来手法と比較してPLSIの性能が最も高く,シソーラス自動構築におけるPLSIの有用性を明らかにした.また,PLSIを類義語の自動獲得へ適用する際の様々な基礎的利用技術についても報告する.A common way to obtain synonym relationships from large corpora is to utilize the features such as cooccurrence and words' context. However, methods based on direct use of surface information concerning to words suffer from noises and sparseness. This paper describes how to utilize PLSI, which is a latent semantic model based on probability theory and information theory, to infer the meaning of words and obtain synonym relationships between nouns. An experiment has shown that PLSI achieves the best performance compared to conventional methods such as tf・idf and LSI, which shows the effectiveness of PLSI for automated construction of thesauri. Various useful techniques when applying PLSI to automatic acquisition of synonyms are also discussed.