Cross-validation under separate sampling: Strong bias and how to correct it

Ulisses M. Braga-Neto, Amin Zollanvari, Edward R. Dougherty

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Motivation: It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics. Results: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an 'almost unbiased' theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used.

Original languageEnglish
Pages (from-to)3349-3355
Number of pages7
JournalBioinformatics
Volume30
Issue number23
DOIs
Publication statusPublished - 2014
Externally publishedYes

Fingerprint

Selection Bias
Cross-validation
Sampling
Computational Biology
Random Sampling
Population
Error Estimator
Error Estimation
Analytical Methods
Pattern Recognition
Bioinformatics
Fold
Error analysis
Pattern recognition
Numerical Methods
Numerical methods
Theorem
Demonstrate

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computational Theory and Mathematics
  • Computer Science Applications
  • Computational Mathematics
  • Statistics and Probability

Cite this

Cross-validation under separate sampling : Strong bias and how to correct it. / Braga-Neto, Ulisses M.; Zollanvari, Amin; Dougherty, Edward R.

In: Bioinformatics, Vol. 30, No. 23, 2014, p. 3349-3355.

Research output: Contribution to journalArticle

Braga-Neto, Ulisses M. ; Zollanvari, Amin ; Dougherty, Edward R. / Cross-validation under separate sampling : Strong bias and how to correct it. In: Bioinformatics. 2014 ; Vol. 30, No. 23. pp. 3349-3355.
@article{45c87e868488492299a1a2fcbd62cb68,
title = "Cross-validation under separate sampling: Strong bias and how to correct it",
abstract = "Motivation: It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics. Results: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an 'almost unbiased' theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used.",
author = "Braga-Neto, {Ulisses M.} and Amin Zollanvari and Dougherty, {Edward R.}",
year = "2014",
doi = "10.1093/bioinformatics/btu527",
language = "English",
volume = "30",
pages = "3349--3355",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "23",

}

TY - JOUR

T1 - Cross-validation under separate sampling

T2 - Strong bias and how to correct it

AU - Braga-Neto, Ulisses M.

AU - Zollanvari, Amin

AU - Dougherty, Edward R.

PY - 2014

Y1 - 2014

N2 - Motivation: It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics. Results: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an 'almost unbiased' theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used.

AB - Motivation: It is commonly assumed in pattern recognition that cross-validation error estimation is 'almost unbiased' as long as the number of folds is not too small. While this is true for random sampling, it is not true with separate sampling, where the populations are independently sampled, which is a common situation in bioinformatics. Results: We demonstrate, via analytical and numerical methods, that classical cross-validation can have strong bias under separate sampling, depending on the difference between the sampling ratios and the true population probabilities. We propose a new separate-sampling cross-validation error estimator, and prove that it satisfies an 'almost unbiased' theorem similar to that of random-sampling cross-validation. We present two case studies with previously published data, which show that the results can change drastically if the correct form of cross-validation is used.

UR - http://www.scopus.com/inward/record.url?scp=84929116149&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84929116149&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btu527

DO - 10.1093/bioinformatics/btu527

M3 - Article

VL - 30

SP - 3349

EP - 3355

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 23

ER -