- 著者
-
河本 哲
秋光 淳生
浅井 紀久夫
- 出版者
- 一般社団法人 人工知能学会
- 雑誌
- 人工知能学会論文誌 (ISSN:13460714)
- 巻号頁・発行日
- vol.38, no.3, pp.D-M51_1-14, 2023-05-01 (Released:2023-05-01)
- 参考文献数
- 31
In Internet advertising, text information is added to increase the appeal of the ad to the viewers. However, some of the advertising documents contain inappropriate expressions. Wording or expressions that exaggerate the efficacy of a product or that recommend a product by a medical professional may violate the Pharmaceutical Affairs Law and the Act against Unjustifiable Premiums and Misleading Representations. Therefore, a system that can effectively and quickly detect problematic advertisements is required. Some advertisements cannot be properly classified based on word statistics alone. Therefore, information other than word statistics must be embedded in the document vector. The advertising documents targeted in this study have characteristics such as “biases in the word positions of specific words” and “periodic occurrence of specific words.” Frequently appearing words in problematic documents (especially in cosmetics advertisements) have strong biases in their word positions, resulting in a complex multimodal distribution of position of occurrence. Therefore, embedding word order information and word period information in document vectors is considered very effective for identifying problematic advertising documents.In recent years, the effectiveness of the BERT model has been recognized in various natural language processing tasks. However, it is also true that faster models are required for application on the Internet advertising. Therefore, as a means of achieving both inference speed and discrimination performance, we propose a document feature based on the discrete Fourier transform(DFT) of word vectors weighted by an index previously proposed in a study that attempted to categorize Chinese Internet advertisements. In addition, we employed the Complex-valued Support Vector Machines as discriminative models that can handle complex numbers and have high generalization performance even with small amounts of data.Although the discrimination performance of the proposed model is inferior to that of ALBERT and BERT to some extent, it is higher than that of DistilBERT, XGBoost, and LightGBM. The inference speed of the proposed model is somewhat slower than XGBoost and LightGBM and needs improvement, but is faster than DistilBERT. Those results indicate that the proposed model is promising when applied on the Internet. In addition, we found that when the index proposed in the previous study (which attempted to categorize Chinese advertisements) was applied to Japanese advertisements, that index emphasized the word vectors of specific nouns and verbs.