Semi-supervised model-based clustering with positive and negative constraints

Volodymyr Melnykov, Igor Melnykov, Semhar Michael

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)

Abstract

Cluster analysis is a popular technique in statistics and computer science with the objective of grouping similar observations in relatively distinct groups generally known as clusters. Semi-supervised clustering assumes that some additional information about group memberships is available. Under the most frequently considered scenario, labels are known for some portion of data and unavailable for the rest of observations. In this paper, we discuss a general type of semi-supervised clustering defined by so called positive and negative constraints. Under positive constraints, some data points are required to belong to the same cluster. On the contrary, negative constraints specify that particular points must represent different data groups. We outline a general framework for semi-supervised clustering with constraints naturally incorporating the additional information into the EM algorithm traditionally used in mixture modeling and model-based clustering. The developed methodology is illustrated on synthetic and classification datasets. A dendrochronology application is considered and thoroughly discussed.

Original languageEnglish
Pages (from-to)327-349
Number of pages23
JournalAdvances in Data Analysis and Classification
Volume10
Issue number3
DOIs
Publication statusPublished - Sep 1 2016

Keywords

  • BIC
  • Finite mixture models
  • Model-based clustering
  • Positive and negative constraints
  • Semi-supervised clustering

ASJC Scopus subject areas

  • Statistics and Probability
  • Computer Science Applications
  • Applied Mathematics

Fingerprint Dive into the research topics of 'Semi-supervised model-based clustering with positive and negative constraints'. Together they form a unique fingerprint.

Cite this