- 著者
-
安岡 孝一
- 出版者
- 京都大學人文科學研究所
- 雑誌
- 東方學報 (ISSN:03042448)
- 巻号頁・発行日
- vol.83, pp.349-360, 2008-09-25
This is a report of the proceedings of the research seminar "Constructing Kanji (漢字) Informatics", which was held from 2004 to 2008, coordinated by Yasuoka Koichi. The seminar started out with considering a hierarchical model for representing digital text using a model consisting of four layers as follows : image layer, text layer, syntax layer and semantic layer. To better understand the relationship of the image and text layer, we spent some time analyzing and trying to understand the rules for vertical layout of complex text in Japanese and other East Asian languages, including the handling of pronounciation guides (so called 'ruby') The next step was to invert the direction and try to identify characters on the image representation of a text, in the same way an optical character recognition program procededes. This turned out to be not so easy, especially with stone rubbings that exhibit a irregular layout of the characters, but worked reasonably well for characters in a regular grid. In moving to the syntactic and semantic layer, the final topic for the seminar was to consider methods for adding punctuation marks (dots) to a Chinese text without any punctuation. After trying a number of different statistical approaches, like looking at characters that appear before or after punctuation dots in already punctuated texts, 2-grams, or even rhyme patterns it became evident that a purely statistical approach would not give the desired results, but that it was necessary to also to take grammatical relations into account. The most promising approach in this respect seemed to be use text with reading marks for kanbun, which do provide some basic grammatical annotation. It was therefore decided to devote a follow up seminar to the development of a corpus of kanbun annotated text that could be used as training and test material for morphological and syntactical parsers.