Citation
Radhwane, Derraz
(2022)
Prediction of rice biomass using machine learning algorithms.
Doctoral thesis, Universiti Putra Malaysia.
Abstract
Conventional rice sampling methods are effective. However, they are
destructive, laborious, time-consuming, impractical for large fields, and subject
to human error. Unmanned aerial vehicles (UAVs) may address these issues.
Machine learning algorithms (MLs) can predict rice biomass from UAV-based
vegetation indices (VIs). Nevertheless, VIs are highly collinear, noisy, and their
large dataset collection is expensive. These issues affect the MLs' model
performance, stability (under/overfitting), variance, and confidence. This study
aims to: (i) compare the base and ensemble MLs’ model performance, variance,
stability, and confidence for predicting rice biomass using collinear
(multicollinearity context (MCC)) and non-collinear (non-multicollinearity context
(NMCC)) VIs; (ii) compare the rice above ground biomass (TAGB) predictability
from noised and Kalman filter’ denoised VIs using histogram gradient boosting
regressor (HGBR); (iii) develop a trigonometric-Euclidean-smoother interpolator
(TESI), including linear (LN-TESI), cubic (C-TESI), quadratic (Q-TESI), and
logarithmic (L-TESI) interpolators, for continuous time-series and non-timeseries
VIs data augmentation, and compare them to the tabular variational
autoencoder (TVAE) and the conditional tabular generative adversarial network
(CTGAN) for preventing DNN’s under/overfitting. A split-plot randomised
complete block design (RCBD) experiment was conducted in a rice granary at
Terengganu, Malaysia, with 120 quadrants. Each quadrant provides five rice
biomass traits during the tillering, booting, and milking stages. A MicaSense Red-
Edge multispectral camera mounted on a DJI quadcopter drone was used to
acquire the blue, green, red, red-edge, and NIR bands to extract the VIs values
corresponding to each quadrant. Besides the biomass dataset, the non-timeseries
fertiliser dataset and the time-series oil palm and rice datasets were also
collected to validate the TESI, TVAE, and CTGAN results. For the first objective,
the MLs model performance and stability were better in MCC than in NMCC for
predicting all rice biomass traits. The ensemble MLs outperformed the base MLs
for predicting all rice biomass traits in MCC and NMCC. All base and ensemble
MLs achieved inconsistent patterns of coefficient of determination (R2) and root mean squared error (RMSE) variances in MCC and NMCC. Multicollinearity and
the base-ensemble MLs concept did not affect the model confidence; rather, the
latter was subject to the cross-effects of the ML and dataset characteristics. For
the second objective, the denoised VIs (R2 = 0.74-0.95, RMSE = 2.43–13.94 g
q-1) outperformed the noised VIs (R2 = 0.63-0.90, RMSE = 3.28–17.91 g q-1) for
the TAGB prediction. The denoised VIs achieved the highest R2 and lowest
RMSE values at the booting stage (R2 = 0.93-0.95, RMSE = 8.22-9.30 g q-1),
then tillering (R2 = 0.75-0.84, RMSE = 2.43-2.96 g q-1), and then milking stages
(R2 = 0.74-0.80, RMSE = 13.34-13.94 g q-1). The HGBR achieved the lowest
overfitting on the denoised VIs at the booting stage with a training-testing R2’s
change (ΔR2) of 0.02-0.09 and a training-testing RMSE’s change (ΔRMSE) of
1.93-6.54 q-1, tillering (ΔR2 = 0.08-0.21, ΔRMSE = 1.23-2.36 g q-1), and then
milking stages (ΔR2 = 0.14-0.25, ΔRMSE = 5.57-10.02 g q-1). For the third
objective, the TESI, TVAE, and CTGAN were applied to increase the four
datasets’ sizes. The TESI retained the features’ original probability distribution in
the four datasets. The C-TESI achieved the lowest mean squared error mean
percentage (MAEP) on the oil palm (0.60–2.85%), rice (0.77–1.72%), and
fertiliser datasets (2.04–2.21%). The TESI retained the variance inflation factor
(VIF) ranges less than 10 on the four datasets; the TESI retained a VIF range of
1.99–10.06 or reduced the VIF range to 1.55–6.66. Furthermore, the TESI
retained the Spearman's r (rs) range of 0.79–0.97 or increased it to 0.81-0.99 on
the four datasets. The DNN achieved the highest R2 (0.77–0.99) and lowest
RMSE ranges (2.8E+01–8.1E+05) on the four datasets augmented with the
TESI. The Q-TESI, C-TESI, and L-TESI overcame the LN-TESI in retaining the
features’ original probability distribution, minimising the augmentation loss,
reducing the VIF, increasing the rs, and decreasing the DNN under- and
overfitting. Overall, as most of the agronomic research is conducted based on a
few sensors’ bands, vegetation indices are highly collinear. Therefore, exploring
the multilevel sensitivity of different MLs to multicollinearity may address the
methodological choices of several future agronomic studies. Additionally, stable
VI-biomass models accurately reflect rice yield potential, which may be
significantly improved by VIs' denoising. Further, the Q-TESI, C-TESI, and LTESI
minimise the proportionality of interpolation error to the square of the
distance between the data points compared to the LN-TESI. Consequently, the
Q-TESI, C-TESI, and L-TESI may approximate the nonlinear changes of crop
phenology in time-spaced sampling, thereby reducing the cost of sampling for
scientists. Furthermore, they intensify non-time series zonal, synthetic sampling,
which reduces sampling labour.
Download File
Additional Metadata
Actions (login required)
|
View Item |