- 著者
-
OHTA Manabu
TAKASU Atsuhiro
ADACHI Jun
- 出版者
- 一般社団法人電子情報通信学会
- 雑誌
- IEICE transactions on information and systems (ISSN:09168532)
- 巻号頁・発行日
- vol.86, no.9, pp.1835-1844, 2003-09-01
Optical Character Reader (OCR) incorrect recognition is a serious problem when searching for OCR-scanned documents in databases such as digital librarics. In order to reduce costs, this paper proposes fuzzy retrieval methods for English text containing errors in the recognized text without correcting the errors manually. The proposed methods generate multiple search terms for each input query term based on probabilistic automata which reflect both error-occurrence probabilities and character-connection probabilities. Experimental results of test-set retrieval indicate that one of the proposed methods improves the recall rate from 95.96% to 98.15% at the cost of a decrease in precision from 100.0% to 96.01% with 20 expanded search terms.