著者
劉 雪琴 金 明哲
出版者
情報知識学会
雑誌
情報知識学会誌 (ISSN:09171436)
巻号頁・発行日
vol.27, no.3, pp.245-260, 2017-09-28 (Released:2017-11-24)
参考文献数
27

近年,テキストから抽出する文体的特徴の変化から筆者の感情や思想,精神状態などの変化を検討する研究が進んでいる.本稿では,脳の大患を経験した宇野浩二という作家の文体変化の時期を計量的アプローチにより分析することを目的とする.宇野浩二は日本の有名な作家であり,1927年に精神病にかかり,約6年間執筆活動を停止していた.1933年に文壇に復帰した宇野浩二の作風は著しく変化したと言われている.しかし,入院する前に発表された「日曜日」は,病後の作品と類似した特徴を示し,宇野浩二の文体は病気休養以前に既に変化し始めていた可能性が示唆されている.本稿では,宇野浩二が入院する直前に発表された作品を分析対象とし,判別分析法を用いて分析を行った.その結果,宇野浩二の文体は入院する前から既に変化し始めていたことがわかった.
著者
孫 昊 金 明哲
出版者
情報知識学会
雑誌
情報知識学会誌 (ISSN:09171436)
巻号頁・発行日
vol.28, no.1, pp.3-14, 2018-02-27 (Released:2018-04-13)
参考文献数
35

川端康成の少女小説における代筆問題は昔から指摘されており,中でも『花日記』は中里恒子の代筆という疑いが強い.本研究では計量文体学の方法を用いて,この小説の代筆問題に新たな解決策を提示する.本研究では,文章から抽出した文字・記号のbigram,形態素タグのbigram,文節パターンを特徴量とし,アダブースト(AdaBoost),高次元判別分析(HDDA),ロジスティックモデルツリー (LMT),サポートべクターマシン(SVM)とランダムフォレスト(RF)を用いて判別分析を行った.分析の結果,『花日記』は川端康成と中里恒子の共同執筆という結論に至った.
著者
金 明哲
出版者
The Behaviormetric Society of Japan
雑誌
行動計量学 (ISSN:03855481)
巻号頁・発行日
vol.36, no.2, pp.89-103, 2009
被引用文献数
1

In this research, as a basis of studies regarding when certain works were written, an estimation was attempted using the works of Ryunosuke Akutagawa. In the experiment, two types of data sets were created from the text with part-of-speech tagging, and a comparative analysis was performed using three methods: Linear Regression, Support Vector Regression, and Random Forest Regression. As a result, when the works were written was estimated with rather high accuracy. The average of absolute value of estimation error and standard deviation was approximately 1.4 years. The order of high accuracy of estimation was Random Forest Regression, Support Vector Regression, and Linear Regression.
著者
財津 亘 金 明哲
出版者
日本法科学技術学会
雑誌
日本法科学技術学会誌 (ISSN:18801323)
巻号頁・発行日
pp.678, (Released:2014-10-31)
参考文献数
28
被引用文献数
1

The effectiveness of identifying the author of an illegal document by using text mining was investigated. The suspected writing evaluated in this study was a claim of responsibility written by a 14-year-old boy, which stated that he committed the “Kobe child murders” in 1997. It was compared with control writings including confessions, and an essay that we knew were written by the same boy, as well as with irrelevant materials including various essays written by five junior high school students, and claims of responsibility in four past criminal cases. First, the writings in each document were digitalized and converted to text files. Then, the relative frequencies of bigram of letters, bigram of part-of-speech taggers, sentence lengths of each document, and rate of using Kanji, Hiragana, and Katakana were calculated. Results of sammon multi-dimensional scaling and hierarchical cluster analysis indicated that the text in the suspected writing was arranged identically or similarly to groups of texts in control materials, where they were arranged differently from groups of texts in irrelevant materials. In a separate analysis, the suspected writing was substituted with a document written by a different offender and we conducted the identical procedure described above. Results demonstrated that texts in the suspected writing were in a different form control and irrelevant texts. These results indicated the effectiveness of identifying an author by using text mining when examining forensic documents.
著者
財津 亘 金 明哲
出版者
日本行動計量学会
雑誌
行動計量学 (ISSN:03855481)
巻号頁・発行日
vol.45, no.1, pp.39-47, 2018 (Released:2018-11-03)
参考文献数
23

This study examined the accuracy for author identification by text mining. We conducted 16 analyses (four writing styles × four multivariate analyses) across texts of 100 Bloggers, written by approximately 1,000 characters. Specifically, we conducted (1) principal components analysis, (2) correspondence analysis, (3) multi-dimensional scaling, and (4) hierarchical cluster analysis on each writing style: (1) rate of usage of non-independent words, (2) bigram of parts-of-speech, (3) bigram of postpositional particles, and (4) positioning of commas. We obtained high accuracy: 100% on sensitivity and 95.1% on specificity. Furthermore, the results showed no effects of age and gender against accuracy for author identification.
著者
金 明哲
出版者
The Behaviormetric Society of Japan
雑誌
行動計量学 (ISSN:03855481)
巻号頁・発行日
vol.40, no.1, pp.17-28, 2013-03-28
被引用文献数
2

This paper proposes a method for authorship identification based on phrase patterns that occur in the Japanese language, using literary work, student's work, journals to carry out actual proof analysis. The results showed that a writer's writing characteristics could be told clearly in phrase patterns. Using Random Forests, the correct ratio for identifying the authors from two arbitrary authors of literary works as well as student compositions was 99% and 92% for journals. In order to show the effectiveness of the proposed method, a comparison between phrase patterns and trigram of POS was conducted. There was no obvious difference found in the rate of correct identification of writer between phrase patterns C and POS trigram. However, when the data of the phrase patterns C were combined with morphological data, it can obtain a higher rate of correct identification of the writer than having combined the data of POS trigram with morphological data. Based on this, we carried out an analysis on the authorship doubt surrounding Kawabata Yasunari's works and the works of Mishima Yukio, HMakoto and Sawana Hisao. Phrase patterns analysis suggested there was no doubt surrounding the authorship in Kawabata's work.
著者
金 明哲
出版者
The Behaviormetric Society of Japan
雑誌
行動計量学 (ISSN:03855481)
巻号頁・発行日
vol.41, no.1, pp.35-46, 2014
被引用文献数
1

Text classification results often vary depending on the detailed factors in data analysis, including feature data, classification method, and parameter sets adopted in the analysis. The author of an anonymous text can be generally identified by extracting a set of distinctive features of the text, and then using the features to find the most likely author. Numerous efforts have been made to develop the feature extraction technique with more robustness and the classification algorithm, but an important issue is how to select the features datasets and classification method. To address this issue, we propose an integrated classification algorithm that extracts multiple feature datasets from differing viewpoints and aspects of a text and applies multiple strong classifiers to the datasets. Our proposed method achieved 100% accuracy in identifying the authors of literary works and student essays, and identified the author of all but 1 out of 60 diaries which were written by 6 different people.Our proposed method achieved equivalent or better accuracy than the case when any a strong classifier applied to individual feature dataset. Furthermore, the accuracy in identifying the authors of student essays increased by roughly two percentage points.
著者
財津 亘 金 明哲
出版者
日本法科学技術学会
雑誌
日本法科学技術学会誌 (ISSN:18801323)
巻号頁・発行日
vol.20, no.1, pp.1-14, 2015 (Released:2015-02-10)
参考文献数
28
被引用文献数
3

The effectiveness of identifying the author of an illegal document by using text mining was investigated. The suspected writing evaluated in this study was a claim of responsibility written by a 14-year-old boy, which stated that he committed the “Kobe child murders” in 1997. It was compared with control writings including confessions, and an essay that we knew were written by the same boy, as well as with irrelevant materials including various essays written by five junior high school students, and claims of responsibility in four past criminal cases. First, the writings in each document were digitalized and converted to text files. Then, the relative frequencies of bigram of letters, bigram of part-of-speech taggers, sentence lengths of each document, and rate of using Kanji, Hiragana, and Katakana were calculated. Results of sammon multi-dimensional scaling and hierarchical cluster analysis indicated that the text in the suspected writing was arranged identically or similarly to groups of texts in control materials, where they were arranged differently from groups of texts in irrelevant materials. In a separate analysis, the suspected writing was substituted with a document written by a different offender and we conducted the identical procedure described above. Results demonstrated that texts in the suspected writing were in a different form control and irrelevant texts. These results indicated the effectiveness of identifying an author by using text mining when examining forensic documents.
著者
孫 昊 金 明哲
出版者
情報知識学会
雑誌
情報知識学会誌 (ISSN:09171436)
巻号頁・発行日
vol.28, no.1, pp.3-14, 2018

<p> 川端康成の少女小説における代筆問題は昔から指摘されており,中でも『花日記』は中里恒子の代筆という疑いが強い.本研究では計量文体学の方法を用いて,この小説の代筆問題に新たな解決策を提示する.本研究では,文章から抽出した文字・記号のbigram,形態素タグのbigram,文節パターンを特徴量とし,アダブースト(AdaBoost),高次元判別分析(HDDA),ロジスティックモデルツリー (LMT),サポートべクターマシン(SVM)とランダムフォレスト(RF)を用いて判別分析を行った.分析の結果,『花日記』は川端康成と中里恒子の共同執筆という結論に至った.</p>
著者
前田 侑亮 金 明哲
出版者
情報知識学会
雑誌
情報知識学会誌 (ISSN:09171436)
巻号頁・発行日
pp.2018_027, (Released:2018-10-19)
参考文献数
22

関西都市圏は「私鉄王国」と呼ばれており,関西5私鉄(近鉄・京阪・南海・阪急・阪神)は競って沿線を開発し,関西都市圏の街づくりの一角を担ってきた.本研究では,関西5私鉄の沿線を文化的価値の側面から定量的に分析し,沿線の特徴を明らかにすることを目的とする.分析においては,どの駅勢圏にどの文化施設等が何回出現したかという頻度行列を作成し,そのカウントデータが持つ情報そのものに焦点を当てられるトピックモデルLDAを用いた.分析の結果,関西5私鉄の沿線には6つの特性が潜んでいると分かった.また,これらの特性を整理し各社の主要路線を分類すると,「歴史的な沿線を持ち,地域密着型の商業地域が目立つ路線」,「都心とその間の郊外を結び,良好な生活環境が整備された路線」,「都心と文教地区を走り,通勤通学の足としての性格が強い路線」の3つに分けることができた.
著者
財津 亘 金 明哲
出版者
情報知識学会
雑誌
情報知識学会誌 (ISSN:09171436)
巻号頁・発行日
vol.28, no.3, pp.253-258, 2018-09-30 (Released:2018-10-19)
参考文献数
5

本論文は,電子掲示板の書き込みなどによるサイバー犯罪を想定し,その犯人性立証の一手法として,多変量データ解析を用いた計量的文体分析を提案するもので,実際に犯人性の立証が困難であったと思われるいわゆる「パソコン遠隔操作事件」に関する著者の識別を試みた.分析の際は,疑問文章(「パソコン遠隔操作事件」の9事件の文章),対照文章(「パソコン遠隔操作事件」の真犯人であったK氏が過去に敢行し,自供した通称「のまねこ事件」関連の5つの文章),無関係文章(K氏と性別年代が同じ30代男性10名のブログ文章と異なる4つの事件における文章)を対象に,①非自立語の使用率,②品詞のtrigram,③助詞のbigram,④文字のbigramに着目し,階層的クラスター分析を実施した.その結果によると,「パソコン遠隔操作事件」の一連の文章と「のまねこ事件」の一連の文章は,同一人が記載したものである可能性を示唆した.
著者
金 明哲
出版者
日本行動計量学会
雑誌
行動計量学 (ISSN:03855481)
巻号頁・発行日
vol.41, no.1, pp.35-46, 2014 (Released:2015-03-10)
参考文献数
25
被引用文献数
1

Text classification results often vary depending on the detailed factors in data analysis, including feature data, classification method, and parameter sets adopted in the analysis. The author of an anonymous text can be generally identified by extracting a set of distinctive features of the text, and then using the features to find the most likely author. Numerous efforts have been made to develop the feature extraction technique with more robustness and the classification algorithm, but an important issue is how to select the features datasets and classification method. To address this issue, we propose an integrated classification algorithm that extracts multiple feature datasets from differing viewpoints and aspects of a text and applies multiple strong classifiers to the datasets. Our proposed method achieved 100% accuracy in identifying the authors of literary works and student essays, and identified the author of all but 1 out of 60 diaries which were written by 6 different people.Our proposed method achieved equivalent or better accuracy than the case when any a strong classifier applied to individual feature dataset. Furthermore, the accuracy in identifying the authors of student essays increased by roughly two percentage points.
著者
財津 亘 金 明哲
出版者
日本法科学技術学会
雑誌
日本法科学技術学会誌
巻号頁・発行日
2017

&emsp;Author identification through text-mining aims to judge whether an author suspected of writing a certain text is same as that of control texts. This study examined the validity of scoring for author identification. In one unit of analysis, we conducted 18 analyses (six writing styles×three multivariate analyses) across one suspected text of a blogger, one control text of a blogger, and irrelevant texts of four bloggers. The writing style factors were (1) rate of usage of non-independent words, (2) bigram of parts-of-speech, (3) bigram of postpositional particles, (4) positioning of commas, (5) rate of usage of Kanji, Hiragana <i>et al.</i>, and (6) sentence length. We completed (1) principal components analysis, (2) corresponding analysis, and (3) multi-dimensional scaling. We obtained scores from arrangements of texts on two dimensions, convex hull polygon (CHP) consisting of control texts was overlapped with that of irrelevant texts (a score of 0). Besides not overlapping each CHP of control and irrelevant texts, (a score of +2) a suspected text arranged into CHP of control texts, (a score of +1) one not arranged into CHP of control texts but near a control text, and (a score of &minus;1) one near an irrelevant text. We totaled the scores in one unit of analysis (18 results) and analyzed the total scores of the 240 units of analysis for 10 bloggers under the following design: 2 (author combination of suspected and control texts: same, different)×4 (number of characters: 250, 500, 1000, 1500)×3 (number of control and irrelevant texts: 3, 6, 9). The results indicated the scoring method was able to identify the authors. AUCs of number of characters were statistically significant, but the number of texts was not significant. Furthermore, rate of usage of non-independent words and parts-of-speech were quite useful to identify authors.<br>
著者
金 明哲 田中 栄一 丁 光躍
雑誌
全国大会講演論文集
巻号頁・発行日
vol.第40回, no.人工知能及び認知科学, pp.480-481, 1990-03-14

近年,中国語の計算機処理の研究が進んでいる.中国語を併音で計算機に入力したり,中国語音声の機械認識をするとき,中国語の言語情報を有効に利用しなければならないことは疑いない.そこで中国語の性質を知るために文献2)の中国語高頻度単語6321語について,声母,韻母の出現頻度,声調分布,字数による単語長,声母数に基づく単語の分布,同字数単語中に占める近距離単語数などの調査を行なった.
著者
孫 昊 李 鍾賛 金 明哲
雑誌
研究報告人文科学とコンピュータ(CH) (ISSN:21888957)
巻号頁・発行日
vol.2015-CH-107, no.8, pp.1-4, 2015-08-02

日本初のノーベル文学賞を受賞した川端康成にまつわる数多くの代作問題があり,その一つは 「花日記」 である.「花日記」 は新潮社 1981 年版の川端全集第 20 巻に収録されているが,本作は当時川端康成を師事した主婦作家・中里恒子の代作という説がある.本研究は文章から抽出した文字・記号列の Bigram,タグの Bigram,文節パターン特徴量を基に,統合的分類アルゴリズムを用いて代作問題を検証した.
著者
金 明哲
出版者
日本行動計量学会
雑誌
行動計量学 (ISSN:03855481)
巻号頁・発行日
vol.40, no.1, pp.17-28, 2013 (Released:2013-09-28)
参考文献数
14
被引用文献数
3

This paper proposes a method for authorship identification based on phrase patterns that occur in the Japanese language, using literary work, student’s work, journals to carry out actual proof analysis. The results showed that a writer’s writing characteristics could be told clearly in phrase patterns. Using Random Forests, the correct ratio for identifying the authors from two arbitrary authors of literary works as well as student compositions was 99% and 92% for journals. In order to show the effectiveness of the proposed method, a comparison between phrase patterns and trigram of POS was conducted. There was no obvious difference found in the rate of correct identification of writer between phrase patterns C and POS trigram. However, when the data of the phrase patterns C were combined with morphological data, it can obtain a higher rate of correct identification of the writer than having combined the data of POS trigram with morphological data. Based on this, we carried out an analysis on the authorship doubt surrounding Kawabata Yasunari’s works and the works of Mishima Yukio, HMakoto and Sawana Hisao. Phrase patterns analysis suggested there was no doubt surrounding the authorship in Kawabata’s work.
著者
金 明哲
出版者
日本行動計量学会
雑誌
行動計量学 (ISSN:03855481)
巻号頁・発行日
vol.36, no.2, pp.89-103, 2009 (Released:2010-06-29)
参考文献数
27
被引用文献数
1 1

In this research, as a basis of studies regarding when certain works were written, an estimation was attempted using the works of Ryunosuke Akutagawa. In the experiment, two types of data sets were created from the text with part-of-speech tagging, and a comparative analysis was performed using three methods: Linear Regression, Support Vector Regression, and Random Forest Regression. As a result, when the works were written was estimated with rather high accuracy. The average of absolute value of estimation error and standard deviation was approximately 1.4 years. The order of high accuracy of estimation was Random Forest Regression, Support Vector Regression, and Linear Regression.