Citation
Dalatu, Paul Inuwa
(2018)
Statistical data preprocessing methods in distance functions to enhance k-means clustering algorithm.
Doctoral thesis, Universiti Putra Malaysia.
Abstract
Clustering is an unsupervised classification method with major aim of partitioning,
where objects in the same cluster are similar, and objects belong to different clusters
vary significantly, with respect to their attributes. The K-Means algorithm is the
commonest and fast technique in partitional cluster algorithms, although with unnormalized
datasets it can achieve local optimal.
We introduced two new approaches to normalization techniques to enhance the
K-Means algorithms. This is to remedy the problem of using the existing Min-Max
(MM) and Decimal Scaling (DS) techniques, which have overflow weakness. The
suggested approaches are called new approach to min-max (NAMM) and decimal
scaling (NADS).
The Hybrid mean algorithms which are based on spherical clusters is also proposed to
remedy the most significant limitation of the K-Means and K-Midranges algorithms. It
is attained successfully by combining the mean in K-Means algorithm, minimum and
maximum in K-Midranges algorithm and compute their average as mean cluster of
Hybrid mean.
The problem of using range function in Heterogeneous Euclidean-Overlap Metric
(HEOM) is addressed by replacing the range with interquartile range function called
Interquartile Range-Heterogeneous Metric (IQR-HEOM). Dividing the HEOM with
range allows outliers to have big effect on the contribution of attributes. Hence,
We proposed interquartile range which is more resistance against outliers in data
pre-processing. It shows that the IQR-HEOM method is more efficient to rectify the
problem caused by using range in HEOM. The Standardized Euclidean distance which uses standard deviation to down weight
maximum points of the ith features on the distance clusters are being criticized in the
literature by many researchers that the method is prone to outliers and has 0% breakdown
points. Therefore, to remedy the problem, we introduced two statistical estimators
called Qn and Sn estimator, both have 50% breakdown points, with their efficiency
as 58% and 82% for Sn and Qn, respectively. The empirical evidences show that the
two suggested methods are more efficient compared to the existing methods.
Download File
Additional Metadata
Item Type: |
Thesis
(Doctoral)
|
Subject: |
Cluster analysis - Mathematical models |
Subject: |
Statistics |
Subject: |
Algorithms |
Call Number: |
FS 2018 26 |
Chairman Supervisor: |
Professor Habshah Midi, PhD |
Divisions: |
Faculty of Science |
Depositing User: |
Ms. Nur Faseha Mohd Kadim
|
Date Deposited: |
28 May 2019 02:45 |
Last Modified: |
28 May 2019 02:45 |
URI: |
http://psasir.upm.edu.my/id/eprint/68681 |
Statistic Details: |
View Download Statistic |
Actions (login required)
|
View Item |