文献一覧: 情報処理学会研究報告自然言語処理(NL) (雑誌)

2 0 0 0 動向情報の要約と可視化に関するワークショップの提案

著者: 加藤恒昭松下光範平尾努
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
巻号頁・発行日: vol.2004, no.108, pp.89-94, 2004-11-05
被引用文献数: 13

動向情報は,製品価格や内閣支持率の変化など,時系列情報に基づき,それを総合的にまとめ上げることで得られるものである.このような動向情報の効率的な提供には,文章だけでなくグラフなどの視覚的手段を利用し,それらを協調させることが必要となる.本稿では,複数文書に分散した様々な動向情報を文章や図表で要約・可視化するという研究課題を提案し,その処理の枠組みを示す.加えて,この課題の研究に有益であろうコーパスについて説明し,これを共通の研究素材とし,動向情報の要約と可視化への関心を共有する研究者によるワークショップを提案する.Trend information is obtained by synthesis and organization of series of temporal information such as transitions of a product price and a degree of public support for a cabinet. Effective communication of trend information should employ as its media not only text but also visual ones such as charts, and use those in a cooperative manner. In this paper, a research theme is proposed, that allows trend information scattered in multiple articles to be gathered, summarized, and provided in linguistically and/or visually. We show a framework to accomplish this research and explain a corpus useful for that purpose. We also propose a workshop on this research on summarization and visualization of trend information in which the researchers share this corpus as a common material.

https://ci.nii.ac.jp/naid/110002949384

2 0 0 0 OA 確率的モデルによる仮名漢字変換

著者: 森信介土屋雅稔山地治長尾真
雑誌: 情報処理学会研究報告自然言語処理(NL)
巻号頁・発行日: vol.1998, no.48(1998-NL-125), pp.93-99, 1998-05-28

本論文では、確率的モデルによる仮名漢字変換を提案する。これは、従来の規則とその重みに基づく仮名漢字変換と異なり、入力に対応する最も確率の高い仮名漢字混じり文を出力とする。この方法の有効性を確かめるため、片仮名列と仮名漢字混じり文を有するコーバスを用いた変換実験を行ない、変換精度を測定した。変換精度は、第一変換候補と正解の最長共通部分列の文字数に基づく再現率と適合率である。この結果、我々の提案する手法による再現率は95.07%であり、適合率は93.94%であった。これは、市販の仮名漢字変換器の一つであるWnn6の同じテストコーパスに対する再現率(91.12%)と適合率(91.17%)を有意に上回っており、確率的モデルによる仮名漢字変換の有効性を示す結果となった。

2018-05-29 09:24:00
2 + 1 Twitter

http://id.nii.ac.jp/1001/00048876/

2 0 0 0 日本語スクリーン・エディタJMACSの機能

著者: 斉藤康己野村浩郷
雑誌: 情報処理学会研究報告自然言語処理(NL)
巻号頁・発行日: vol.1982, no.6, pp.1-6, 1982-05-21

2018-05-18 15:00:20
2 + 0 Twitter

https://ci.nii.ac.jp/naid/170000045354

2 0 0 0 OA かな漢字変換における誤入力の訂正

著者: 山本喜大久保田淳市庄田幸恵白井豊
雑誌: 情報処理学会研究報告自然言語処理(NL)
巻号頁・発行日: vol.1997, no.109(1997-NL-122), pp.105-111, 1997-11-20

日本語の入力操作全体での効率向上のために、かな漢字変換上に誤入力を訂正する機能を試作した。本かな漢字変換は、置換、挿入、脱落誤りのいずれかを一つ含む文節を訂正する。ユーザの訂正指示操作に応じて複数の誤り訂正候補を提示しユーザが選択する方法(選択型訂正)と、変換操作に応じて自動的に誤りを訂正する方法(自動訂正)の2つのインターフェースを想定し性能を評価した。実験の結果、選択型訂正の再現率は第1位が39%、上位6位までで59%であった。また自動訂正の再現率は4%、誤り率は3%であった。

2017-03-26 02:41:06
2 + 1 Twitter

http://id.nii.ac.jp/1001/00048922/

2 0 0 0 新聞記事中の事故・事件名の自動抽出

著者: 野畑周佐田いち子井佐原均
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
巻号頁・発行日: vol.2005, no.50, pp.125-130, 2005-05-27
被引用文献数: 1

ある特定の出来事について知ろうとして新聞記事などを読むとき、その出来事を示す表現は何らかの形でその記事の中に現われている。しかし、その表現の文字列は一意でないことが多い。文章中の人名や組織名などの表現は、現われる文章に依らず固定していることが多く、それらの表現を自動的に取り出す固有表現抽出システムの精度は近年の研究によって高まっている。それを利用して自由度のより高い出来事を示す表現を汎用的な手法で自動的に抽出することは、情報抽出のための固有表現抽出としては拡張の方向性の一つであり、また自動要約や機械翻訳などの分野においても、文書間の話題のつながりを捉えたり、二言語間で対応する表現の範囲を広げたりする点で有用である。本論文では、特定の出来事を指す表現のうち、「事件・事故名」を対象として、その抽出方法の提案と評価を行う。When we read newspaper articles to obtain knowledge about a specific event, some expressions that denote the event appear in each article, but these expressions are more flexible and elusive than named entities like person names, organization names. Since the performance of a named entity recognizer has recently become better, it is one of the next steps to use recognized named entities for recognizing event expressions. The recognition of event expressions is also useful in detection of the same topic between multiple documents for automatic summarization, and between different languages for machine translation. In this paper, we present a method and evaluation results of extraction of specific incident names as a part of event expressions.

2016-10-11 06:49:02
2 + 0 Twitter

https://ci.nii.ac.jp/naid/110002949470

2 0 0 0 OA 日本語ツリーバンク「檜」:言語理解のためのコーパス

著者: Bond Francis 藤田早苗橋本力笠原要成山重子 Nichols Eric 大谷朗田中貴秋天野成昭
雑誌: 情報処理学会研究報告自然言語処理(NL)
巻号頁・発行日: vol.2004, no.1(2003-NL-159), pp.83-90, 2004-01-13

本稿では、基本語彙知識ベース構築の一環として構築した、ツリーバンク「檜」を紹介する。「檜」は、HPSGで書かれた日本語文法JaCYに基づいて辞書の語義文を解析したものであり、詳細な統語情報と意味情報の両方が付与されている。本稿では、「檜」構築の目的や理論的基盤などについて述べる。また、「檜」の有効性を示す一例として、知識獲得の予備実験を行なった結果について報告する。

http://id.nii.ac.jp/1001/00048210/

2 0 0 0 日英報道記事からの訳語対応推定:ターム頻度と訳語対応推定性能の相関の評価

著者: 日野浩平宇津呂武仁中川聖一
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
巻号頁・発行日: vol.2004, no.73, pp.57-63, 2004-07-15
被引用文献数: 1

近年,ウェブ上の日本国内の新聞社などのサイトにおいては,日本語だけでなく英語で書かれた報道記事も掲載しており,これらの英語記事においては,同一時期の日本語記事とほぼ同じ内容の報道が含まれている.本研究では,これらの報道記事のページから,日本語で書かれた文書および英語で書かれた文書を収集し,多種多様な分野について,分野固有の固有名詞(固有表現)や事象・言い回しなどの翻訳知識を自動または半自動で獲得するというアプローチをとる.翻訳知識獲得においては,まず,報道内容がほぼ同一もしくは密接に関連した日本語記事および英語記事を検索する.そして,関連記事組における訳語候補の共起に基づく相関尺度を用いて,二言語間の訳語対応を推定する.本稿では,この尺度を用い,英語タームの出現頻度の分布に応じて,訳語対応推定性能が変化するかどうかを調査し,その相関を評価する.そして,英語タームの頻度が大きいほど,高い訳語対応推定性能が達成できることを示す.This paper focuses on bilingual news articles on WWW news sites as a source for translation knowledge acquisition. We take an approach of acquiring translation knowledge of domain specific named entities, event expressions, and collocational expressions from the collection of bilingual news articles on WWW news sites. In this framework, pairs of Japanese and English news articles which report identical contents or at least closely related contents are retrieved. Then, a statistical measure is employed for the task of estimating bilingual term correspondences based on co-occurrence of Japanese and English terms across relevant Japanese and English news articles. This paper then examines the correlation of term frequencies and correctness of term correspondences estimation. We experimentally show that the more frequent the target English terms be, the more reliably bilingual term correspondences can be estimated.

2016-05-12 05:40:05
2 + 0 Twitter

https://ci.nii.ac.jp/naid/110002911728

2 0 0 0 OA 数式の意味解釈とその文法及びメタ言語

著者: 趙燕結櫻井鉄也杉浦洋鳥居達生
雑誌: 情報処理学会研究報告自然言語処理(NL)
巻号頁・発行日: vol.1997, no.29(1996-NL-118), pp.73-78, 1997-03-21

我々は自然言語処理とプログラミング言語処理の手法に基づいて,形式化された数式の文法,その文法を生成できるメタ言語,及びそれらの解釈系を作成し,数式の多義性を避けて,文脈依存的数式の意味解釈を行なう.

2016-04-18 06:46:11
2 + 0 Twitter

http://id.nii.ac.jp/1001/00048999/

2 0 0 0 類似度に基づく推論を用いた質問応答システム

著者: 村田真樹内山将夫井佐原均
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
巻号頁・発行日: vol.2000, no.11, pp.181-188, 2000-01-27
参考文献数: 19
被引用文献数: 38

質問応答システムの研究は,TREC8やAAAIにおいても重要な問題として位置づけられている.本研究では,自然言語で書かれた知識データと質問文を,類似度に基づいて照合することにより,全自動で解を取り出すシステムを作成した.このシステムの有効性を確かめるために,TREC8のホームページや英検の問題から取ったサンプルデータで実験したところ,良好な結果を得た.Research on question-answering systems is now considered to be extremely important in TREC8 and AAAI. In this paper, we constructed a question answering system which matches a question with knowledge-based data written in natural language and automatically selects the answer. We tested this system using sample data taken from TREC8's homepages and Eiken textbooks, and obtained good results.

2015-11-12 20:51:00
2 + 3 Twitter

https://ci.nii.ac.jp/naid/110002935192

2 0 0 0 表層的手がかりによる六法全書法律文での要件部・効果部の抽出手法

著者: 角田達彦清水仁長尾眞
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL)
巻号頁・発行日: vol.1997, no.4, pp.129-136, 1997-01-20
被引用文献数: 2

本稿では,六法全書法律文の大局的構造の解析と要件の意味推定を表層的手がかりによって行なう手法を提案する.文の構成要素を主題,要件,効果に分け,それらが対比構造をなしているかを調べ,その結果によって各主題や要件の係り先を特定する.そして各要件の機能表現によって要件のさす内容を特定する.同時に主題の連体修飾部や,効果部に入りこんだ要件の抽出を行なう.その結果,六法全書の条文181文の学習コーパスに対して170文(3%)が,そして275文のテストコーパスに対して224文(1%)が正しく解析できた.また,とりたて助詞「は」と読点の有無が対比構造の生成・認識の鍵となり,それによって係り先が決定されることを明らかにした.We propose a method of automatic detection of global structure and semantical logics in legal sentences. Firstly, the method extracts elements in them and classifies them into three types: subject, condition, and effect. Second, it checks whether they have comparison structures, and, depending on the result, specifies their dependency. Finally, it grasps their contents using surface clues and extracts conditions from the subject and effect parts. Our method achieved 93% correctness for 181 training sentences, and 81% correctness for 275 unseen sentences. We also clarified the importance of particle 'ha' and commas for generating and understanding comparison structure, which decides phrase dependency.

2015-10-07 09:31:55
2 + 0 Twitter

https://ci.nii.ac.jp/naid/110002934641

2 0 0 0 OA 変換ミスチェッカーのための辞書生成

著者: 脇田早紀子金子宏
雑誌: 情報処理学会研究報告自然言語処理(NL)
巻号頁・発行日: vol.1996, no.7(1995-NL-111), pp.27-32, 1996-01-19

日本語校正支援システムに対して、変換ミスを漏れなく指摘してほしいという要請は以前からあった。我々が作っていたシステムでは、パターンという校正知識記述法と誤用語辞書を利用して変換ミスを検出していたが、余計な警告を出さない方針に徹していたので検出率は変換ミスの50%程度であった。漏れなく警告するため同音異義語を持つ語をすべて警告することも考えられるが、警告数が多くなりすぎて実用的でない。そこで、回りの語・語列・品詞・品詞列などを手がかりに、正しい変換らしいものを除いて警告数を抑える仕組みを作った。本発表では、その辞書を過去の文書の蓄積から自動作成する試みについて述べる。

http://id.nii.ac.jp/1001/00049138/

2 0 0 0 OA PLSAによる確率的概念空間の評価

著者: 持橋大地松本裕治
雑誌: 情報処理学会研究報告自然言語処理(NL)
巻号頁・発行日: vol.2003, no.4(2002-NL-153), pp.41-47, 2003-01-20

本報告では語彙の意味的概念の空間内での表現に関し空間の性質によらない評価基準を示し確率的表現が従来のベクトル空間での表現より優れていることを見る.また計算量上問題となる概念空間の次元数に対し AICによる最適次元数の決定を試みた.

2015-06-20 19:35:14
2 はてなブックマーク

http://id.nii.ac.jp/1001/00048327/

2 0 0 0 OA 個人の選好に応じた単語の重要度の学習

著者: 持橋大地加来田裕和
雑誌: 情報処理学会研究報告自然言語処理(NL)
巻号頁・発行日: vol.1999, no.2(1998-NL-129), pp.35-39, 1999-01-20

メッセージ処理などにおいて,単語に重みづけを行うことは基本的で重要な課題である.従来このための手法としてtf・idfが用いられてきたが,tf・idfは文脈を考慮していないため,重要な語を落としてしまう可能性がある.本研究では,単語の重要度の基準として周辺分布に着目し,頻度と組み合わせた形での指標を提案する.この手法はテキストが文書に分かれない環境でも重みづけが可能であり,学習データによって適応的な重みづけが得られる.また,電子メールの重要性判定に適用することで,内容による優先度判定やフィルタリングが行えることが示唆された.

2015-06-20 19:33:14
2 はてなブックマーク

http://id.nii.ac.jp/1001/00048783/

2 0 0 0 OA 連想としての意味

著者: 持橋大地松本裕治
雑誌: 情報処理学会研究報告自然言語処理(NL)
巻号頁・発行日: vol.1999, no.95(1999-NL-134), pp.155-162, 1999-11-25

本論文では,単語の意味を単語間の連想関係を表す確率分布として表現し,その定式化と連想確率の獲得について述べる.単語の意味的な重みを表す指標として単語の共起確率分布の情報量から計算される連想情報量を提案し,共起確率との組み合わせにより連想確率を計算する.連想はMarkov過程の上で行われ,その状態確率分布として意味が定義される.状態遷移として連想を行うことによって,直接共起しない語の意味的な関係が表現できる.また,確率ベクトルとして捉えた意味のスケール変換として文脈を捉え,先行単語集合の数を仮定しない非線型な更新式を提案し,これにより文脈の強化と順序への依存が表現できることを示す.現実のテキストから意味を獲得し,文脈をモデル化することで,意味的類似度や文脈解析だけでなく,情報検索などにおいて様々な実際的な意味処理が可能になる.

2015-06-20 19:31:09
2 はてなブックマーク

http://id.nii.ac.jp/1001/00048718/

2 0 0 0 近代日本小説家8人による文章のn - gram分布を用いた著者判別

著者: 松浦司金田康正
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
巻号頁・発行日: vol.2000, no.53, pp.1-8, 2000-06-01
被引用文献数: 9

本稿では、文章中のn-gram分布状況を著者の特徴量として、文章の著者を推定する手法を提案する。文章中におけるn-gram出現確率分布関数間の非類似度に基づいて著者推定を行うが、非類似度は提案関数dissimの他、Tankardの手法、ダイヴァージェンス、およびクロスエントロピーを用いてそれぞれ計算し、4関数の著者判別精度を比較した。1-gramから10-gram分布を特徴量とし、日本近代作家8人の92作品を対象とする著者推定実験結果について報告する。本手法は文章に関する付加的な情報を全く必要とせず形態素解析などを要求しない。また特定の言語および文章の性質を利用しないため、多くの言語・テキストにそのまま適用可能であることが期待できる。We propose a method for authorship detection based comparisons between n-gram distributions in sentences. The authors are detected via dissimilarity between probability distribution functions of n-grams in sentences. We have compared four functions to measure the dissimilarity, i.e. dissim(proposed function), Tankard's method, divergence and cross entropy. We report the experiments where the 92 works in total by 8 Japanese modern authors are analyzed via from 1-gram to 10-gram distribution. Our method requires no additional information on texts, i.e. no preliminary analyses. All the machine-readable texts can be attributed by the same method.

2015-06-04 11:08:22
2 + 0 Twitter

https://ci.nii.ac.jp/naid/110002935193

2 0 0 0 格変換による単文の言い換え

著者: 近藤恵子佐藤理史奥村学
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
巻号頁・発行日: vol.2000, no.11, pp.119-126, 2000-01-27
被引用文献数: 3

本稿では,格変換による単文の言い換えを機械的に実現する方法を提案する.我々は,そのために必要な42の格変換規則と,言い換えに必要な情報を得るために使用する「動詞辞書」「自動詞?他動詞対応辞書」「有情/非情名詞辞書」の3つの辞書を作成した.格変換規則は,格のマッピング,述語のマッピング,名詞句の制約条件,動詞の制約条件から成る.名詞句の制約条件は,入力文の名詞句が有情か非情かで規則の適用を制限する.動詞の制約条件は,入力文の動詞の種類,使役形のタイプ,受動の可否,格から規則の適用を制限する.辞書は,変換する動詞を得るためと,制約条件を確認するために使用される.我々は,この規則と辞書を実装した言い換えシステムを作成した.言い換えシステムは,格変換規則を繰り返し適用することで言い換えを実現する.我々はこのシステムの実験を行い,有効性を確認した.This paper proposes a method of automatical paraphrasing of a simple sentence by case alternation. We make 42 case-alternation rules and three dictionaries: the verb dictionary, the dictionary that records intransitive verbs and their corresponding transitive verbs, and the dictionary that records animateness/inanimateness of nouns. A case-alternation rule consists of a cases mapping, a predicate mapping, a condition for a noun phrase, and a condition for a verb. The condition for a noun phrase restricts to applying the rule to an input sentence by whether the noun phrase in the sentence is animate ness or inanimateness. The condition for a verb restricts to applying the rule to an input sentence by the verb's type, the causative voice, the passive voice, and cases. We have constructed the paraphrase system implemented these alternation rules. This system generates all possible paraphrasing. We have conducted an experiment with this system, and show the effectiveness of the method.

https://ci.nii.ac.jp/naid/110002935184

2 0 0 0 Non - negative Matrix Factorizationを用いた情報検索

著者: 柘植覚獅々堀正幹北研二
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
巻号頁・発行日: vol.2001, no.20, pp.1-6, 2001-03-05
被引用文献数: 4

ベクトル空間モデル(Vector Space Model; VSM)は情報検索における代表的な検索モデルであり,検索対象文書および検索質問を多次元ベクトルで表現するという特徴を持っている.しかし,これらのベクトルは一般にスパースかつ高次元であるため,計算機のメモリによる制限や検索時間の増大などの問題が生じる.また,次元が増加するに連れ,文書中に含まれる不必要な単語がノイズ的な影響を及ぼし検索精度を低下させてしまうという現象も起こってくる.本稿では,Non-negative Matrix Factorization(NMF)を用いたベクトル空間モデルの次元圧縮手法を提案する.NMFは非負行列を2つの非負行列の積に分解する手法であり,分解された非負の2行列は基底行列とその基底のもとでの座標値から成る行列とみなすことができる.基底行列のランクを元の行列のランクより小さくすることにより,次元圧縮が可能となる.NMFは,主成分分析や特異値分解などと異なり,非負制約条件で行列分解を行うため,元の行列を減算を伴わない加算のみの線形結合で表現することができる.これは部分から全体を構成するという我々の直観を反映している.また,NMFは単純な繰り返し演算のみで実行可能であるため,大規模な行列に対して,計算コストや記憶容量の点で他の次元削減手法よりも優れている.MEDLINEコレクションを用いた検索実験を行い,NMFは通常のベクトル空間モデルよりも高い検索性能を示すことができた.The Vector Space Model(VSM) is a conventional information retrieval model, which represents a document collection by a term-by-document matrix. Since term-by-document matrices are usually high-dimensional and sparse, they are susceptible to noise and are also difficult to capture the underlying semantic structure. Additionally, the storage and processing of such matrices places great demands on computing resources. Dimensionality reduction is a way to overcome these problems. Principal Component Analysis(PCA) and Singular Value Decomposition(SVD) are popular techniques for dimensionality reduction based on matrix decomposition, but they contain both positive and negative values in the decomposed matrices. In the work described here, we use non-negative matrix factorization(NMF) for dimensionality reduction of the vector space model. Since decomposed matrices by NMF only contain non-negative values, the original data is represented by only additive, not subtractive, combinations of the basis vectors. This characteristic of parts-based representation is appealing because it reflects the intuitive notion of combining parts to form a whole. Also NMF computation is based on the simple iterative algorithm, it is therefore advantageous for applications involving large matrices. Using MEDLINE collection, we experimentally showed that NMF offers great improvement over the vector space model.

2014-12-07 15:12:00
2 はてなブックマーク

https://ci.nii.ac.jp/naid/110002934289

2 0 0 0 主題・焦点リンクを用いた重要文抽出システム

著者: 横山晶一菅野崇西原典孝
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
巻号頁・発行日: vol.2003, no.76, pp.1-6, 2003-07-25
被引用文献数: 1 1

数多くの文章の中から情報を抽出したり、文章要約のために重要文を抽出したりする手法は、自動化が望まれ、また、さまざまな研究手法が提案されている分野である。我々は、すでに、主題・焦点を用いたキーワード抽出システムを作成して、抽出されたキーワードが有効であることを確認した。本研究では、この手法で抽出されたキーワードを用いた重要文抽出システムについて述べる。抽出されたキーワードを文の重み付けに用いる手法は、従来類似の研究がいくつかある。このシステムでは、要約する文章の中での重み付けされた文の割合を基本要約率として指定し、それらの文と、主題・焦点リンクを通じて抽出される文とで相補って重要文を抽出するところが従来研究とは異なる。本稿では、重要文抽出結果を示すとともに、小規模な評価も行ってこの手法の有効性を示す。Information extraction of some sentences and extraction of important sentences are requested for automatic processing, and proposed with many methods and systems. We have already proposed a keyword extraction system using themes and focuses, and confirmed the effectiveness of them. This paper proposes an extraction system of important sentences using the keywords extracted from the above method. There are some studies with keyword weighting for important sentences. Uniqueness of our system is that weighted sentences are specified as the fundamental summarization rate, and that related sentences with theme-focus link from these sentences are derived. We show the results of important sentence extraction with different types, and also show the effectiveness of this method with results of human evaluation.

2014-11-13 02:00:00
2 はてなブックマーク

https://ci.nii.ac.jp/naid/110002911649

2 0 0 0 HTMLの表形式データの変換と携帯端末表示への応用

著者: 塚本修一増田英孝中川裕志
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
巻号頁・発行日: vol.2002, no.87, pp.35-42, 2002-09-17
被引用文献数: 1

本研究は、HTMLの表形式データの構造の認識とその後の利用を目的とした変換のために、表の項目名と項目データの境界を認識するシステムを実現した。表はデータを整理し、見やすくする性質がある。しかし、携帯端末などの低解像度小画面にHTMLの表を表示する場合、スクロールすると項目名の部分が見えなくなってしまう。また、罫線が引かれている為に、表示領域にも制限が出来、単語途中の折り返しにより可読性が低下する。そこで、本研究では、表のデータをユーザが要求する形に出力するための基礎技術として、HTMLの表の構造を認識するアルゴリズムを提案する。提案手法は、表の行間あるいは列間の類似度による。すなわち類似度が低い場合には、行間あるいは列間に内容的な切れ目があると認識する。このアルゴリズムを実際のWebページ上の表データに適用したところ80%程度の認識率を得た。We implemented a recognition system to identify the boundary between attribute names and values of a table in HTML in order to obtain its structure. Table in HTML is aimed at displaying information clearly and understandably. However, users can't see the attributes of the table by using PDA, because of its small and low resolution display when they browse the Web pages. Its low readability is caused by the phenomena such that only a small portion of table is shown on the screen at once, and original one line is usually broken up into many lines on display screens. We propose an algorithm to recognize the structure of tables in HTML for the purpose of transforming them into forms of high readability even on a small screen of mobile terminal. Our method utilizes a similarity between rows(or columns)of the table. Precisely speaking, if we find an adjacent pair of rows(or columns) having low similarity, they probably are boundaries between item name row(or column)and item data rows(or columns). We achieved approximately 80% accuracy of recognition by applying our algorithm to existing tables on the Web.

https://ci.nii.ac.jp/naid/110002934364

2 0 0 0 読点に頼らない統計的構文解析

著者: 金山博
出版者: 一般社団法人情報処理学会
雑誌: 情報処理学会研究報告自然言語処理(NL) (ISSN:09196072)
巻号頁・発行日: vol.2005, no.117, pp.61-66, 2005-11-21

日本語の統計的構文解析において、自立語の語彙の違いが統計モデル上で充分に反映されず、語彙運択を必要とする係り受けの解析誤りの原因となっている。本稿では、「既存の統計的構文解析器は、読点に過剰に依存している」という仮定に基づき、読点を無視して学習を行う統計モデルを構築して、用言に係る助詞句の係り受けの改良を図る。提案手法により、語彙を区別する素性の効用が増すとともに、不自然な読点が打たれている文に対しての頑健性が高まった。In Japanese statistical syntactic parsing, the selection of content words does not have much effect on dependency decision between bunsetsus mainly because of the data sparseness. To overcome parsing errors caused by this lack of lexical information, this paper proposes a statistical learning method that ignores commas in sentences, drawing on the observation that the existing statistical parsers rely too much on such punctuation. This method increases the effect of features that distinguish among content words, and the model is robust for sentences where commas are not used properly.

https://ci.nii.ac.jp/naid/110002973359