UPM Institutional Repository

Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data


Citation

Burhanuddin, Nurul Afiqah (2024) Bayesian nonparametric clustering with Dirichlet process mixture model for mixed-type data. Doctoral thesis, Universiti Putra Malaysia.

Abstract

Mixture models have been applied regularly by many researchers for clustering and density estimations. In particular, the Bayesian nonparametric mixture model involving the Dirichlet process prior has recently enjoyed popularity in clustering due to its flexibility, allowing the number of mixture components to grow infinitely. In this thesis, we aim to present some modifications of Bayesian nonparametric methods focusing on clustering mixed-type data, where the data comprises of continuous, ordinal, and nominal data. Many studies have shown successful applications of the Dirichlet process mixture (DPM) model for clustering continuous data. However, the recent DPM model for clustering mixed-type data assumes a common covariance matrix across clusters, which is too restrictive in real practice. Accordingly, we develop a DPM model for clustering mixed-type data that allows for cluster-specific covariance matrices. To demonstrate the flexibility of our model, we compare it with the model with a common covariance matrix. Through this comparison, our model shows superior performance in terms of Normalized Mutual Information (NMI) in simulated datasets with different cluster shapes and two real data applications. Our model also succeeds in estimating the true number of clusters in all cases as opposed to the model with a common covariance assumption that tends to overcluster the data. When dealing with multivariate data, not all variables contribute towards cluster discrimination. To distinguish between relevant and irrelevant clustering variables, the DPM model for mixed-type data is further extended by specifying hierarchical shrinkage prior on the component means. This can be thought of as an implicit variable selection in clustering. The hierarchical shrinkage prior considered involves the normal-gamma prior for the continuous and ordinal data; while for nominal data, the grouped normal-gamma prior is used. The performances of the proposed model with shrinkage prior and without shrinkage prior are then compared. The comparison shows that the model with shrinkage prior achieves better clustering performance with higher NMI value, especially in simulated datasets with highly overlapping clusters and real datasets. Throughout the comparison, the model with shrinkage prior also produces a tighter clustering output measured in the form of silhouette width. Furthermore, the proposed model also successfully distinguishes relevant variables from noisy ones, as reflected by higher NMI value observed when the model is fitted with only the relevant variables. The standard DPM model is introduced to address unsupervised learning problems where the data is analyzed without any background knowledge. To consider this extra knowledge in the clustering process, we develop a constrained DPM model that can incorporate labels as side information. These labels are considered in our formulation through a product partition prior that gives clusters of observations with similar labels a higher prior preference. The formulation is further extended to handle multiple side information. The empirical results on several simulated and real datasets show that our model consistently improves its clustering performance in terms of NMI value as more labeled data become available. Even in the presence of noisy labels, the proposed model rarely performs worse than the standard unsupervised model, especially on continuous datasets. In multiple side information experiments, consistent increments in NMI value are also observed with access to more side information.


Download File

[img] Text
118420 (IR).pdf

Download (986kB)
Official URL or Download Paper: http://ethesis.upm.edu.my/id/eprint/18377

Additional Metadata

Item Type: Thesis (Doctoral)
Subject: Mixture (Mathematics)
Subject: Clustering (Statistics)
Subject: Bayesian statistical decision theory
Call Number: IPM 2024 6
Chairman Supervisor: Hani Syahida binti Zulkafli, PhD
Divisions: Institute for Mathematical Research
Keywords: Bayesian nonparametric, clustering, Dirichlet process, mixture model, model-based clustering
Depositing User: Ms. Rohana Alias
Date Deposited: 04 Aug 2025 07:34
Last Modified: 04 Aug 2025 07:34
URI: http://psasir.upm.edu.my/id/eprint/118420
Statistic Details: View Download Statistic

Actions (login required)

View Item View Item