Joint sampling distribution between actual and estimated classification errors for linear discriminant analysis

Amin Zollanvari, Ulisses M. Braga-Neto, Edward R. Dougherty

Research output: Contribution to journalArticle

26 Citations (Scopus)

Abstract

Error estimation must be used to find the accuracy of a designed classifier, an issue that is critical in biomarker discovery for disease diagnosis and prognosis in genomics and proteomics. This paper presents, for what is believed to be the first time, the analytical formulation for the joint sampling distribution of the actual and estimated errors of a classification rule. The analysis presented here concerns the linear discriminant analysis (LDA) classification rule and the resubstitution and leave-one-out error estimators, under a general parametric Gaussian assumption. Exact results are provided in the univariate case, and a simple method is suggested to obtain an accurate approximation in the multivariate case. It is also shown how these results can be applied in the computation of condition bounds and the regression of the actual error, given the observed error estimate. In contrast to asymptotic results, the analysis presented here is applicable to finite training data. In particular, it applies in the small-sample settings commonly found in genomics and proteomics applications. Numerical examples, which include parameters estimated from actual microarray data, illustrate the analysis throughout.

Original languageEnglish
Article number5420275
Pages (from-to)784-804
Number of pages21
JournalIEEE Transactions on Information Theory
Volume56
Issue number2
DOIs
Publication statusPublished - Feb 2010
Externally publishedYes

Fingerprint

discriminant analysis
Discriminant analysis
Sampling
Biomarkers
Microarrays
Error analysis
Classifiers
Disease
regression
Genomics
Proteomics

Keywords

  • Classification
  • Cross-validation
  • Error estimation
  • Leave-one-out
  • Linear discriminant analysis
  • Resubstitution
  • Sampling distribution

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Library and Information Sciences

Cite this

Joint sampling distribution between actual and estimated classification errors for linear discriminant analysis. / Zollanvari, Amin; Braga-Neto, Ulisses M.; Dougherty, Edward R.

In: IEEE Transactions on Information Theory, Vol. 56, No. 2, 5420275, 02.2010, p. 784-804.

Research output: Contribution to journalArticle

@article{bd9ae9bacc154033b6530ef6e0bd0ff4,
title = "Joint sampling distribution between actual and estimated classification errors for linear discriminant analysis",
abstract = "Error estimation must be used to find the accuracy of a designed classifier, an issue that is critical in biomarker discovery for disease diagnosis and prognosis in genomics and proteomics. This paper presents, for what is believed to be the first time, the analytical formulation for the joint sampling distribution of the actual and estimated errors of a classification rule. The analysis presented here concerns the linear discriminant analysis (LDA) classification rule and the resubstitution and leave-one-out error estimators, under a general parametric Gaussian assumption. Exact results are provided in the univariate case, and a simple method is suggested to obtain an accurate approximation in the multivariate case. It is also shown how these results can be applied in the computation of condition bounds and the regression of the actual error, given the observed error estimate. In contrast to asymptotic results, the analysis presented here is applicable to finite training data. In particular, it applies in the small-sample settings commonly found in genomics and proteomics applications. Numerical examples, which include parameters estimated from actual microarray data, illustrate the analysis throughout.",
keywords = "Classification, Cross-validation, Error estimation, Leave-one-out, Linear discriminant analysis, Resubstitution, Sampling distribution",
author = "Amin Zollanvari and Braga-Neto, {Ulisses M.} and Dougherty, {Edward R.}",
year = "2010",
month = "2",
doi = "10.1109/TIT.2009.2037034",
language = "English",
volume = "56",
pages = "784--804",
journal = "IEEE Transactions on Information Theory",
issn = "0018-9448",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "2",

}

TY - JOUR

T1 - Joint sampling distribution between actual and estimated classification errors for linear discriminant analysis

AU - Zollanvari, Amin

AU - Braga-Neto, Ulisses M.

AU - Dougherty, Edward R.

PY - 2010/2

Y1 - 2010/2

N2 - Error estimation must be used to find the accuracy of a designed classifier, an issue that is critical in biomarker discovery for disease diagnosis and prognosis in genomics and proteomics. This paper presents, for what is believed to be the first time, the analytical formulation for the joint sampling distribution of the actual and estimated errors of a classification rule. The analysis presented here concerns the linear discriminant analysis (LDA) classification rule and the resubstitution and leave-one-out error estimators, under a general parametric Gaussian assumption. Exact results are provided in the univariate case, and a simple method is suggested to obtain an accurate approximation in the multivariate case. It is also shown how these results can be applied in the computation of condition bounds and the regression of the actual error, given the observed error estimate. In contrast to asymptotic results, the analysis presented here is applicable to finite training data. In particular, it applies in the small-sample settings commonly found in genomics and proteomics applications. Numerical examples, which include parameters estimated from actual microarray data, illustrate the analysis throughout.

AB - Error estimation must be used to find the accuracy of a designed classifier, an issue that is critical in biomarker discovery for disease diagnosis and prognosis in genomics and proteomics. This paper presents, for what is believed to be the first time, the analytical formulation for the joint sampling distribution of the actual and estimated errors of a classification rule. The analysis presented here concerns the linear discriminant analysis (LDA) classification rule and the resubstitution and leave-one-out error estimators, under a general parametric Gaussian assumption. Exact results are provided in the univariate case, and a simple method is suggested to obtain an accurate approximation in the multivariate case. It is also shown how these results can be applied in the computation of condition bounds and the regression of the actual error, given the observed error estimate. In contrast to asymptotic results, the analysis presented here is applicable to finite training data. In particular, it applies in the small-sample settings commonly found in genomics and proteomics applications. Numerical examples, which include parameters estimated from actual microarray data, illustrate the analysis throughout.

KW - Classification

KW - Cross-validation

KW - Error estimation

KW - Leave-one-out

KW - Linear discriminant analysis

KW - Resubstitution

KW - Sampling distribution

UR - http://www.scopus.com/inward/record.url?scp=77649305686&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77649305686&partnerID=8YFLogxK

U2 - 10.1109/TIT.2009.2037034

DO - 10.1109/TIT.2009.2037034

M3 - Article

AN - SCOPUS:77649305686

VL - 56

SP - 784

EP - 804

JO - IEEE Transactions on Information Theory

JF - IEEE Transactions on Information Theory

SN - 0018-9448

IS - 2

M1 - 5420275

ER -