BioDArt - Catalogue of biological data artifact examples

Anitha Veeramani, Kavitha Gopalakrishnan, Vladimir Brusic, Judice L Y Koh

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Information in biological data repositories continues to grow exponentially due to the increasing genomic and proteomic sequencing projects. As with any database, these data repositories are subjected to data quality issues related to correctness, uniformity, completeness, redundancy, among others. Data cleaning is a prerequisite to prevent the interference of low quality data with the accuracy of data mining and analysis. This in turn involves the detection and resolution of data artifacts (errors, discrepancies, redundancies, ambiguities, and incompleteness). Understanding the causes of data artifacts and systematically classifying them are critical towards their elimination in molecular sequence databases. This paper highlights eight data artifacts found among public molecular databases. Examples of major molecular sequence database records containing these artifacts are collected into the BioDArt catalogue (http://antigen.i2r.a-star.edu.sg/BioDArt).

Original languageEnglish
Title of host publicationICBPE 2006 - Proceedings of the 2006 International Conference on Biomedical and Pharmaceutical Engineering
Pages324-329
Number of pages6
DOIs
Publication statusPublished - 2006
Externally publishedYes
EventICBPE 2006 - 2006 International Conference on Biomedical and Pharmaceutical Engineering - Singapore, Singapore
Duration: Dec 11 2006Dec 14 2006

Other

OtherICBPE 2006 - 2006 International Conference on Biomedical and Pharmaceutical Engineering
CountrySingapore
CitySingapore
Period12/11/0612/14/06

Fingerprint

Chemical Databases
Artifacts
Redundancy
Data Mining
Antigens
Proteomics
Stars
Data mining
Cleaning
Databases
Data Accuracy

Keywords

  • Data artifacts
  • Data cleaning
  • Data quality

ASJC Scopus subject areas

  • Biomedical Engineering
  • Pharmacology (medical)
  • Pharmacology, Toxicology and Pharmaceutics(all)

Cite this

Veeramani, A., Gopalakrishnan, K., Brusic, V., & Koh, J. L. Y. (2006). BioDArt - Catalogue of biological data artifact examples. In ICBPE 2006 - Proceedings of the 2006 International Conference on Biomedical and Pharmaceutical Engineering (pp. 324-329). [4155917] https://doi.org/10.1109/ICBPE.2006.348608

BioDArt - Catalogue of biological data artifact examples. / Veeramani, Anitha; Gopalakrishnan, Kavitha; Brusic, Vladimir; Koh, Judice L Y.

ICBPE 2006 - Proceedings of the 2006 International Conference on Biomedical and Pharmaceutical Engineering. 2006. p. 324-329 4155917.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Veeramani, A, Gopalakrishnan, K, Brusic, V & Koh, JLY 2006, BioDArt - Catalogue of biological data artifact examples. in ICBPE 2006 - Proceedings of the 2006 International Conference on Biomedical and Pharmaceutical Engineering., 4155917, pp. 324-329, ICBPE 2006 - 2006 International Conference on Biomedical and Pharmaceutical Engineering, Singapore, Singapore, 12/11/06. https://doi.org/10.1109/ICBPE.2006.348608
Veeramani A, Gopalakrishnan K, Brusic V, Koh JLY. BioDArt - Catalogue of biological data artifact examples. In ICBPE 2006 - Proceedings of the 2006 International Conference on Biomedical and Pharmaceutical Engineering. 2006. p. 324-329. 4155917 https://doi.org/10.1109/ICBPE.2006.348608
Veeramani, Anitha ; Gopalakrishnan, Kavitha ; Brusic, Vladimir ; Koh, Judice L Y. / BioDArt - Catalogue of biological data artifact examples. ICBPE 2006 - Proceedings of the 2006 International Conference on Biomedical and Pharmaceutical Engineering. 2006. pp. 324-329
@inproceedings{460d15f66c704befa67dca1b79a7f4fa,
title = "BioDArt - Catalogue of biological data artifact examples",
abstract = "Information in biological data repositories continues to grow exponentially due to the increasing genomic and proteomic sequencing projects. As with any database, these data repositories are subjected to data quality issues related to correctness, uniformity, completeness, redundancy, among others. Data cleaning is a prerequisite to prevent the interference of low quality data with the accuracy of data mining and analysis. This in turn involves the detection and resolution of data artifacts (errors, discrepancies, redundancies, ambiguities, and incompleteness). Understanding the causes of data artifacts and systematically classifying them are critical towards their elimination in molecular sequence databases. This paper highlights eight data artifacts found among public molecular databases. Examples of major molecular sequence database records containing these artifacts are collected into the BioDArt catalogue (http://antigen.i2r.a-star.edu.sg/BioDArt).",
keywords = "Data artifacts, Data cleaning, Data quality",
author = "Anitha Veeramani and Kavitha Gopalakrishnan and Vladimir Brusic and Koh, {Judice L Y}",
year = "2006",
doi = "10.1109/ICBPE.2006.348608",
language = "English",
isbn = "8190426249",
pages = "324--329",
booktitle = "ICBPE 2006 - Proceedings of the 2006 International Conference on Biomedical and Pharmaceutical Engineering",

}

TY - GEN

T1 - BioDArt - Catalogue of biological data artifact examples

AU - Veeramani, Anitha

AU - Gopalakrishnan, Kavitha

AU - Brusic, Vladimir

AU - Koh, Judice L Y

PY - 2006

Y1 - 2006

N2 - Information in biological data repositories continues to grow exponentially due to the increasing genomic and proteomic sequencing projects. As with any database, these data repositories are subjected to data quality issues related to correctness, uniformity, completeness, redundancy, among others. Data cleaning is a prerequisite to prevent the interference of low quality data with the accuracy of data mining and analysis. This in turn involves the detection and resolution of data artifacts (errors, discrepancies, redundancies, ambiguities, and incompleteness). Understanding the causes of data artifacts and systematically classifying them are critical towards their elimination in molecular sequence databases. This paper highlights eight data artifacts found among public molecular databases. Examples of major molecular sequence database records containing these artifacts are collected into the BioDArt catalogue (http://antigen.i2r.a-star.edu.sg/BioDArt).

AB - Information in biological data repositories continues to grow exponentially due to the increasing genomic and proteomic sequencing projects. As with any database, these data repositories are subjected to data quality issues related to correctness, uniformity, completeness, redundancy, among others. Data cleaning is a prerequisite to prevent the interference of low quality data with the accuracy of data mining and analysis. This in turn involves the detection and resolution of data artifacts (errors, discrepancies, redundancies, ambiguities, and incompleteness). Understanding the causes of data artifacts and systematically classifying them are critical towards their elimination in molecular sequence databases. This paper highlights eight data artifacts found among public molecular databases. Examples of major molecular sequence database records containing these artifacts are collected into the BioDArt catalogue (http://antigen.i2r.a-star.edu.sg/BioDArt).

KW - Data artifacts

KW - Data cleaning

KW - Data quality

UR - http://www.scopus.com/inward/record.url?scp=46249105561&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=46249105561&partnerID=8YFLogxK

U2 - 10.1109/ICBPE.2006.348608

DO - 10.1109/ICBPE.2006.348608

M3 - Conference contribution

SN - 8190426249

SN - 9788190426244

SP - 324

EP - 329

BT - ICBPE 2006 - Proceedings of the 2006 International Conference on Biomedical and Pharmaceutical Engineering

ER -