Citation
Bappah, Mohammed Mohammed
(2021)
Modified frequency tables for visualisation and analysis of univariate and bivariate data.
Doctoral thesis, Universiti Putra Malaysia.
Abstract
One way to make sense of data is to organize it into a more meaningful format called
frequency table. The existing continuous univariate frequency table uses the midpoint
to represent the magnitude of observations in each class, which results in an
error called grouping error. The use of the midpoint is due to the assumption that
each class’s observations are uniformly distributed and concentrated around their
midpoint, which is not always valid. The most significant parameter used when
constructing the continuous frequency table is the number of classes or class width.
Several rules for choosing the number of classes or class width have been reported
in the literature; however, none has been proven to be better in all situations. The
existing discrete frequency tables are simple to construct, easy to understand and
interpret. However, when the number of elements in data is substantial, the table
can be complicated. The existing non-parametric correlation measure, the Kendall
correlation method, becomes laborious when the number of paired continuous observations
is large enough. Generally, continuous data are measured values such as
amount rainfall, length
In this research, to address the issue of grouping error, we proposed three statistics,
median, midrange, and random selection to be used as the magnitude of observations
in each class instead of the midpoint. In choosing the number of classes or
class width, a new class width rule is proposed. We also proposed new discrete frequency
tables that can be constructed by grouping the elements in data into classes.
Using the bivariate continuous frequency table, a new correlation measure that is
straightforward and free of normality assumption is developed. On addressing the
issue of missing data in a univariate continuous frequency table, five different imputation
methods are compared.
The four methods and the binning rules are simultaneously compared using root
mean-squared-error (RMSE). Whereas the comparison using real data, the absolute
error is used. The proposed discrete frequency tables are described using simulated
and real data. While the new bivariate continuous table’s correlation measure is
illustrated using simulations and real data. Generally, continuous data are measured
values such as amount rainfall, length
The comparison using the continuous frequency table’s measure of location, mean,
showed that the methods that used the median and midrange of observations in each
class performed better relative to other methods. In choosing the number of classes,
the proposed class width rule is the best for data simulated from the normal and exponential
distributions. Meanwhile, for data simulated from the uniform distribution,
the square root rule performed better than the other rules. The methods’ evaluation
using the frequency table’s measures of skewness and kurtosis indicated that still,
the methods that used the median and midrange to represent the magnitude of observations
in each class were the best. The new discrete frequency tables can be a
better choice, since, they can handle datasets with a substantial number of elements,
and vividly reveals the significant features of datasets. Generally, continuous data
are measured values such as amount rainfall, length
The results also showed that the new measure of correlation approximately equals
to the Kendall correlation. Indeed, it can be used when the data is discrete, and the
best alternative when the number of paired observations is large. In handling missing
data, the simulation results showed that the mean imputation method is the best
while the findings using real data indicated the mean imputation, k nearest neighbor
imputation, and the multiple imputations by chained equations were the best methods.
Also, the five imputation methods’ performance is independent of the dataset
and the percentage of missingness. And that the error increases as the percentage of
missing observations increases.
Download File
Additional Metadata
Item Type: |
Thesis
(Doctoral)
|
Subject: |
Social sciences - Statistical methods |
Subject: |
Statistics |
Subject: |
Multivariate analysis |
Call Number: |
FS 2021 41 |
Chairman Supervisor: |
Mohd Bakri Adam, PhD |
Divisions: |
Faculty of Science |
Depositing User: |
Ms. Nur Faseha Mohd Kadim
|
Date Deposited: |
01 Jun 2022 07:57 |
Last Modified: |
01 Jun 2022 07:57 |
URI: |
http://psasir.upm.edu.my/id/eprint/92818 |
Statistic Details: |
View Download Statistic |
Actions (login required)
|
View Item |