The illusion of distribution-free small-sample classification in genomics

Edward R. Dougherty, Amin Zollanvari, Ulisses M. Braga-Neto

Research output: Contribution to journalArticle

38 Citations (Scopus)

Abstract

Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.

Original languageEnglish
Pages (from-to)333-341
Number of pages9
JournalCurrent Genomics
Volume12
Issue number5
Publication statusPublished - Aug 2011
Externally publishedYes

Fingerprint

Genomics
Computational Biology
Sample Size
Phenotype
Datasets

Keywords

  • Classification
  • Epistemology
  • Error estimation
  • Genomics
  • Validation

ASJC Scopus subject areas

  • Genetics
  • Genetics(clinical)

Cite this

The illusion of distribution-free small-sample classification in genomics. / Dougherty, Edward R.; Zollanvari, Amin; Braga-Neto, Ulisses M.

In: Current Genomics, Vol. 12, No. 5, 08.2011, p. 333-341.

Research output: Contribution to journalArticle

Dougherty, ER, Zollanvari, A & Braga-Neto, UM 2011, 'The illusion of distribution-free small-sample classification in genomics', Current Genomics, vol. 12, no. 5, pp. 333-341.
Dougherty, Edward R. ; Zollanvari, Amin ; Braga-Neto, Ulisses M. / The illusion of distribution-free small-sample classification in genomics. In: Current Genomics. 2011 ; Vol. 12, No. 5. pp. 333-341.
@article{3fad70556abf4bea9a5b6c197d9d5298,
title = "The illusion of distribution-free small-sample classification in genomics",
abstract = "Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.",
keywords = "Classification, Epistemology, Error estimation, Genomics, Validation",
author = "Dougherty, {Edward R.} and Amin Zollanvari and Braga-Neto, {Ulisses M.}",
year = "2011",
month = "8",
language = "English",
volume = "12",
pages = "333--341",
journal = "Current Genomics",
issn = "1389-2029",
publisher = "Bentham Science Publishers",
number = "5",

}

TY - JOUR

T1 - The illusion of distribution-free small-sample classification in genomics

AU - Dougherty, Edward R.

AU - Zollanvari, Amin

AU - Braga-Neto, Ulisses M.

PY - 2011/8

Y1 - 2011/8

N2 - Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.

AB - Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion.

KW - Classification

KW - Epistemology

KW - Error estimation

KW - Genomics

KW - Validation

UR - http://www.scopus.com/inward/record.url?scp=79961007365&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79961007365&partnerID=8YFLogxK

M3 - Article

VL - 12

SP - 333

EP - 341

JO - Current Genomics

JF - Current Genomics

SN - 1389-2029

IS - 5

ER -