著者
Christin Rakers Daniel Reker J.B. Brown
出版者
日本化学会 情報化学部会
雑誌
Journal of Computer Aided Chemistry (ISSN:13458647)
巻号頁・発行日
vol.18, pp.124-142, 2017 (Released:2017-08-01)
参考文献数
72
被引用文献数
14

The identification of new compound-protein interactions has long been the fundamental quest in the field of medicinal chemistry. With increasing amounts of biochemical data, advanced machine learning techniques such as active learning have been proven to be beneficial for building high-performance prediction models upon subsets of such complex data. In a recently published paper, chemogenomic active learning had been applied to the interaction spaces of kinases and G protein-coupled receptors featuring over 150,000 compound-protein interactions. Prediction models were actively trained based on random forest classification using 500 decision trees per experiment. In a new direction for chemogenomic active learning, we address the question of how forest size influences model evolution and performance. In addition to the original chemogenomic active learning findings that highly predictive models could be constructed from a small fraction of the available data, we find here that that model complexity as viewed by forest size can be reduced to one-fourth or one-fifth of the previously investigated forest size while still maintaining reliable prediction performance. Thus, chemogenomic active learning can yield predictive models with reduced complexity based on only a fraction of the data available for model construction.