Identification of "pathologs" (disease-related genes) from the RIKEN mouse cDNA dataset using human curation plus FACTS, a new biological information extraction system

Diego G. Silva, Christian Schönbach, Vladimir Brusic, Luis A. Socha, Takeshi Nagashima, Nikolai Petrovsky

Research output: Contribution to journalArticle

6 Citations (Scopus)

Abstract

Background. A major goal in the post-genomic era is to identify and characterise disease susceptibility genes and to apply this knowledge to disease prevention and treatment. Rodents and humans have remarkably similar genomes and share closely related biochemical, physiological and pathological pathways. In this work we utilised the latest information on the mouse transcriptome as revealed by the RIKEN FANTOM2 project to identify novel human disease-related candidate genes. We define a new term "patholog" to mean a homolog of a human disease-related gene encoding a product (transcript, anti-sense or protein) potentially relevant to disease. Rather than just focus on Mendelian inheritance, we applied the analysis to all potential pathologs regardless of their inheritance pattern. Results. Bioinformatic analysis and human curation of 60,770 RIKEN full-length mouse cDNA clones produced 2,578 sequences that showed similarity (70-85% identity) to known human-disease genes. Using a newly developed biological information extraction and annotation tool (FACTS) in parallel with human expert analysis of 17,051 MEDLINE scientific abstracts we identified 182 novel potential pathologs. Of these, 36 were identified by computational tools only, 49 by human expert analysis only and 97 by both methods. These pathologs were related to neoplastic (53%), hereditary (24%), immunological (5%), cardio-vascular (4%), or other (14%), disorders. Conclusions. Large scale genome projects continue to produce a vast amount of data with potential application to the study of human disease. For this potential to be realised we need intelligent strategies for data categorisation and the ability to link sequence data with relevant literature. This paper demonstrates the power of combining human expert annotation with FACTS, a newly developed bioinformatics tool, to identify novel pathologs from within large-scale mouse transcript datasets.

Original languageEnglish
Article number28
JournalBMC Genomics
Volume5
DOIs
Publication statusPublished - Apr 29 2004
Externally publishedYes

Fingerprint

Information Storage and Retrieval
Information Systems
Complementary DNA
Genes
Computational Biology
Genome
Datasets
Inheritance Patterns
Disease Susceptibility
Transcriptome
MEDLINE
Blood Vessels
Rodentia
Clone Cells

Keywords

  • Bioinformatics
  • Cancer
  • Disease gene
  • FANTOM database
  • Genomics
  • Hereditary disease
  • Human
  • Transcripts

ASJC Scopus subject areas

  • Medicine(all)

Cite this

Identification of "pathologs" (disease-related genes) from the RIKEN mouse cDNA dataset using human curation plus FACTS, a new biological information extraction system. / Silva, Diego G.; Schönbach, Christian; Brusic, Vladimir; Socha, Luis A.; Nagashima, Takeshi; Petrovsky, Nikolai.

In: BMC Genomics, Vol. 5, 28, 29.04.2004.

Research output: Contribution to journalArticle

@article{489c5958ad0c467cabee3e29e37f1b56,
title = "Identification of {"}pathologs{"} (disease-related genes) from the RIKEN mouse cDNA dataset using human curation plus FACTS, a new biological information extraction system",
abstract = "Background. A major goal in the post-genomic era is to identify and characterise disease susceptibility genes and to apply this knowledge to disease prevention and treatment. Rodents and humans have remarkably similar genomes and share closely related biochemical, physiological and pathological pathways. In this work we utilised the latest information on the mouse transcriptome as revealed by the RIKEN FANTOM2 project to identify novel human disease-related candidate genes. We define a new term {"}patholog{"} to mean a homolog of a human disease-related gene encoding a product (transcript, anti-sense or protein) potentially relevant to disease. Rather than just focus on Mendelian inheritance, we applied the analysis to all potential pathologs regardless of their inheritance pattern. Results. Bioinformatic analysis and human curation of 60,770 RIKEN full-length mouse cDNA clones produced 2,578 sequences that showed similarity (70-85{\%} identity) to known human-disease genes. Using a newly developed biological information extraction and annotation tool (FACTS) in parallel with human expert analysis of 17,051 MEDLINE scientific abstracts we identified 182 novel potential pathologs. Of these, 36 were identified by computational tools only, 49 by human expert analysis only and 97 by both methods. These pathologs were related to neoplastic (53{\%}), hereditary (24{\%}), immunological (5{\%}), cardio-vascular (4{\%}), or other (14{\%}), disorders. Conclusions. Large scale genome projects continue to produce a vast amount of data with potential application to the study of human disease. For this potential to be realised we need intelligent strategies for data categorisation and the ability to link sequence data with relevant literature. This paper demonstrates the power of combining human expert annotation with FACTS, a newly developed bioinformatics tool, to identify novel pathologs from within large-scale mouse transcript datasets.",
keywords = "Bioinformatics, Cancer, Disease gene, FANTOM database, Genomics, Hereditary disease, Human, Transcripts",
author = "Silva, {Diego G.} and Christian Sch{\"o}nbach and Vladimir Brusic and Socha, {Luis A.} and Takeshi Nagashima and Nikolai Petrovsky",
year = "2004",
month = "4",
day = "29",
doi = "10.1186/1471-2164-5-28",
language = "English",
volume = "5",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Identification of "pathologs" (disease-related genes) from the RIKEN mouse cDNA dataset using human curation plus FACTS, a new biological information extraction system

AU - Silva, Diego G.

AU - Schönbach, Christian

AU - Brusic, Vladimir

AU - Socha, Luis A.

AU - Nagashima, Takeshi

AU - Petrovsky, Nikolai

PY - 2004/4/29

Y1 - 2004/4/29

N2 - Background. A major goal in the post-genomic era is to identify and characterise disease susceptibility genes and to apply this knowledge to disease prevention and treatment. Rodents and humans have remarkably similar genomes and share closely related biochemical, physiological and pathological pathways. In this work we utilised the latest information on the mouse transcriptome as revealed by the RIKEN FANTOM2 project to identify novel human disease-related candidate genes. We define a new term "patholog" to mean a homolog of a human disease-related gene encoding a product (transcript, anti-sense or protein) potentially relevant to disease. Rather than just focus on Mendelian inheritance, we applied the analysis to all potential pathologs regardless of their inheritance pattern. Results. Bioinformatic analysis and human curation of 60,770 RIKEN full-length mouse cDNA clones produced 2,578 sequences that showed similarity (70-85% identity) to known human-disease genes. Using a newly developed biological information extraction and annotation tool (FACTS) in parallel with human expert analysis of 17,051 MEDLINE scientific abstracts we identified 182 novel potential pathologs. Of these, 36 were identified by computational tools only, 49 by human expert analysis only and 97 by both methods. These pathologs were related to neoplastic (53%), hereditary (24%), immunological (5%), cardio-vascular (4%), or other (14%), disorders. Conclusions. Large scale genome projects continue to produce a vast amount of data with potential application to the study of human disease. For this potential to be realised we need intelligent strategies for data categorisation and the ability to link sequence data with relevant literature. This paper demonstrates the power of combining human expert annotation with FACTS, a newly developed bioinformatics tool, to identify novel pathologs from within large-scale mouse transcript datasets.

AB - Background. A major goal in the post-genomic era is to identify and characterise disease susceptibility genes and to apply this knowledge to disease prevention and treatment. Rodents and humans have remarkably similar genomes and share closely related biochemical, physiological and pathological pathways. In this work we utilised the latest information on the mouse transcriptome as revealed by the RIKEN FANTOM2 project to identify novel human disease-related candidate genes. We define a new term "patholog" to mean a homolog of a human disease-related gene encoding a product (transcript, anti-sense or protein) potentially relevant to disease. Rather than just focus on Mendelian inheritance, we applied the analysis to all potential pathologs regardless of their inheritance pattern. Results. Bioinformatic analysis and human curation of 60,770 RIKEN full-length mouse cDNA clones produced 2,578 sequences that showed similarity (70-85% identity) to known human-disease genes. Using a newly developed biological information extraction and annotation tool (FACTS) in parallel with human expert analysis of 17,051 MEDLINE scientific abstracts we identified 182 novel potential pathologs. Of these, 36 were identified by computational tools only, 49 by human expert analysis only and 97 by both methods. These pathologs were related to neoplastic (53%), hereditary (24%), immunological (5%), cardio-vascular (4%), or other (14%), disorders. Conclusions. Large scale genome projects continue to produce a vast amount of data with potential application to the study of human disease. For this potential to be realised we need intelligent strategies for data categorisation and the ability to link sequence data with relevant literature. This paper demonstrates the power of combining human expert annotation with FACTS, a newly developed bioinformatics tool, to identify novel pathologs from within large-scale mouse transcript datasets.

KW - Bioinformatics

KW - Cancer

KW - Disease gene

KW - FANTOM database

KW - Genomics

KW - Hereditary disease

KW - Human

KW - Transcripts

UR - http://www.scopus.com/inward/record.url?scp=2442719005&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=2442719005&partnerID=8YFLogxK

U2 - 10.1186/1471-2164-5-28

DO - 10.1186/1471-2164-5-28

M3 - Article

VL - 5

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

M1 - 28

ER -