著者
AARONJ.STOKES HIDEO MATSUDA AKIHIRO HASHIMOTO
雑誌
情報処理学会論文誌データベース(TOD) (ISSN:18827799)
巻号頁・発行日
vol.40, no.SIG06(TOD3), pp.66-78, 1999-08-15

Complete DNA sequences (complete genomes) for an increasing number of organisms are becoming available each year for use in biological research. However genome project groups incorporate their own formats (or schemas) for representing the genome data accumulated by the projects. Such heterogeneity of their schemas prevents researchers from exchanging and comparing their data across genomes. In this paper we present a new method for exchanging and querying information on complete genomes. Since genomes and the genetic information encoded on them have a hierarchical structure they can be represented as a kind of structured document. We propose a document language called GXML for representing complete genomes. The document language based on XML can be used to exchange many kinds of genomic data and offers a high degree of extensibility. We also define a query language called GQL to operate on the genome documents. Using this language one can easily associate henes among different genomes and perform other biological analyses. We developed a prototype system based on the language. Using the system we executed several test queries. The results were consistent with those published in biological literature. The processor and memory requirements of the prototype system were accptable.
著者
Tomoshige Ohno Shigeto Seno Yoichi Takenaka Hideo Matsuda
雑誌
研究報告バイオ情報学(BIO)
巻号頁・発行日
vol.2012, no.13, pp.1-7, 2012-06-21

Alternative splicing plays an important role in eukaryotic gene expression by producing diverse proteins from a single gene. Predicting how genes are transcribed is of great biological interest. To this end, massively parallel whole transcriptome sequencing, often referred to as RNA-Seq, is becoming widely used and is revolutionizing the cataloging isoforms using a vast number of short mRNA fragments called reads. Conventional RNA-Seq analysis methods typically align reads onto a reference genome (mapping) in order to capture the form of isoforms that each gene yields and how much of every isoform is expressed from an RNA-Seq dataset. However, a considerable number of reads cannot be mapped uniquely. Those so-called multireads that are mapped onto multiple locations due to short read length and analogous sequences inflate the uncertainty as to how genes are transcribed. This causes inaccurate gene expression estimations and leads to incorrect isoform prediction. To cope with this problem, we propose a method for isoform prediction by iterative mapping. The positions from which multireads originate can be estimated based on the information of expression levels, whereas quantification of isoform-level expression requires accurate mapping. These procedures are mutually dependent, and therefore remapping reads is essential. By iterating this cycle, our method estimates gene expression levels more precisely and hence improves predictions of alternative splicing. Our method simultaneously estimates isoform-level expressions by computing how many reads originate from each candidate isoform using an EM algorithm within a gene. To validate the effectiveness of the proposed method, we compared its performance with conventional methods using an RNA-Seq dataset derived from a human brain. The proposed method had a precision of 66.7% and outperformed conventional methods in terms of the isoform detection rate.Alternative splicing plays an important role in eukaryotic gene expression by producing diverse proteins from a single gene. Predicting how genes are transcribed is of great biological interest. To this end, massively parallel whole transcriptome sequencing, often referred to as RNA-Seq, is becoming widely used and is revolutionizing the cataloging isoforms using a vast number of short mRNA fragments called reads. Conventional RNA-Seq analysis methods typically align reads onto a reference genome (mapping) in order to capture the form of isoforms that each gene yields and how much of every isoform is expressed from an RNA-Seq dataset. However, a considerable number of reads cannot be mapped uniquely. Those so-called multireads that are mapped onto multiple locations due to short read length and analogous sequences inflate the uncertainty as to how genes are transcribed. This causes inaccurate gene expression estimations and leads to incorrect isoform prediction. To cope with this problem, we propose a method for isoform prediction by iterative mapping. The positions from which multireads originate can be estimated based on the information of expression levels, whereas quantification of isoform-level expression requires accurate mapping. These procedures are mutually dependent, and therefore remapping reads is essential. By iterating this cycle, our method estimates gene expression levels more precisely and hence improves predictions of alternative splicing. Our method simultaneously estimates isoform-level expressions by computing how many reads originate from each candidate isoform using an EM algorithm within a gene. To validate the effectiveness of the proposed method, we compared its performance with conventional methods using an RNA-Seq dataset derived from a human brain. The proposed method had a precision of 66.7% and outperformed conventional methods in terms of the isoform detection rate.