國府 久嗣 山崎 治子 野坂 政司
日本感性工学会論文誌 (ISSN:18840833)
vol.12, no.4, pp.511-518, 2013 (Released:2013-12-11)

Extracting keywords from a target text data is essential for an analysis to describe substance characteristics of message content. We picked a use of a stopword filter from among alternatives because the method has the advantage that it is simple yet effective way. The filter we present was made up of non-content words and low-content words. Non-content-bearing words consisted mainly of function words and were gotten rid of by using part-of-speech (POS) tag information. High occurrence rate words in remaining had prospects of being keywords, however usually there were some low-content words like delexical verbs and so on. This article presents a stopword list obtained to come up with low-content words by sensuous manual procedures carried out using 40 text files from the CASTEL/J database and establishes it in the view of general versatility.
國府 久嗣 園田 勝英
情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
vol.2007, no.35, pp.15-20, 2007-03-28

日本語テクストに含まれる語彙項目間のコロケーションに着目し、その状況を視覚化することでメッセージ分析を行なう方法について考察した。このとき統計手法としては主に多次元尺度構成法を用いている。本発表ではコロケーション定義のうち重要な部位をなす Span について、値や判定法を変化させた際の分析結果との相関について検討した。これによって語彙項目以外を Span に含まない方式には、分析結果が span の値によって過敏には左右されない特徴があることを明らかにしている。対象テクストが恒常的に有していると考えられるメッセージを抽出し分析するという観点からはこの性質はのぞましい点にも言及した。In this paper we will suggest that it will be useful for interpreting the message(s) of a Japanese text to visualize its frequencies of lexical collocations. The visualization is based on MDS. We explore into the effects of various settings of span. Span is currently considered to be the central parameter of the notion "collocation" in that two elements are said to be in collocation when they cooccur in a certain specified span. It is shown that various settings of the span length do not significantly affect the final configurations obtained through visualization, when span is defined with non-lexical, i.e. functinal, elements excluded. The result supports our initial suggestion because the message of a text we are trying to capture is one of its constant properties.