Citation
Burhanuddin, Nurul Afiqah
(2024)
Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data.
Doctoral thesis, Universiti Putra Malaysia.
Abstract
Mixture models have been applied regularly by many researchers for clustering
and density estimations. In particular, the Bayesian nonparametric mixture
model involving the Dirichlet process prior has recently enjoyed popularity in
clustering due to its flexibility, allowing the number of mixture components to
grow infinitely. In this thesis, we aim to present some modifications of Bayesian
nonparametric methods focusing on clustering mixed-type data, where the data
comprises of continuous, ordinal, and nominal data.
Many studies have shown successful applications of the Dirichlet process mixture
(DPM) model for clustering continuous data. However, the recent DPM model for
clustering mixed-type data assumes a common covariance matrix across clusters,
which is too restrictive in real practice. Accordingly, we develop a DPM model
for clustering mixed-type data that allows for cluster-specific covariance matrices.
To demonstrate the flexibility of our model, we compare it with the model with a
common covariance matrix. Through this comparison, our model shows superior
performance in terms of Normalized Mutual Information (NMI) in simulated
datasets with different cluster shapes and two real data applications. Our model
also succeeds in estimating the true number of clusters in all cases as opposed to
the model with a common covariance assumption that tends to overcluster the
data.
When dealing with multivariate data, not all variables contribute towards cluster
discrimination. To distinguish between relevant and irrelevant clustering variables,
the DPM model for mixed-type data is further extended by specifying
hierarchical shrinkage prior on the component means. This can be thought of
as an implicit variable selection in clustering. The hierarchical shrinkage prior
considered involves the normal-gamma prior for the continuous and ordinal data;
while for nominal data, the grouped normal-gamma prior is used. The performances
of the proposed model with shrinkage prior and without shrinkage prior
are then compared. The comparison shows that the model with shrinkage prior
achieves better clustering performance with higher NMI value, especially in simulated
datasets with highly overlapping clusters and real datasets. Throughout
the comparison, the model with shrinkage prior also produces a tighter clustering
output measured in the form of silhouette width. Furthermore, the proposed
model also successfully distinguishes relevant variables from noisy ones, as reflected
by higher NMI value observed when the model is fitted with only the
relevant variables.
The standard DPM model is introduced to address unsupervised learning problems
where the data is analyzed without any background knowledge. To consider
this extra knowledge in the clustering process, we develop a constrained DPM
model that can incorporate labels as side information. These labels are considered
in our formulation through a product partition prior that gives clusters of
observations with similar labels a higher prior preference. The formulation is
further extended to handle multiple side information. The empirical results on
several simulated and real datasets show that our model consistently improves its
clustering performance in terms of NMI value as more labeled data become available.
Even in the presence of noisy labels, the proposed model rarely performs
worse than the standard unsupervised model, especially on continuous datasets.
In multiple side information experiments, consistent increments in NMI value are
also observed with access to more side information.
Download File
Additional Metadata
Item Type: |
Thesis
(Doctoral)
|
Subject: |
Mixture (Mathematics) |
Subject: |
Clustering (Statistics) |
Subject: |
Bayesian statistical decision theory |
Call Number: |
IPM 2024 6 |
Chairman Supervisor: |
Hani Syahida binti Zulkafli, PhD |
Divisions: |
Institute for Mathematical Research |
Keywords: |
Bayesian nonparametric, clustering, Dirichlet process, mixture model,
model-based clustering |
Depositing User: |
Ms. Rohana Alias
|
Date Deposited: |
04 Aug 2025 07:34 |
Last Modified: |
04 Aug 2025 07:34 |
URI: |
http://psasir.upm.edu.my/id/eprint/118420 |
Statistic Details: |
View Download Statistic |
Actions (login required)
 |
View Item |