Nonlinear Prediction with Data Augmentation and Regularization: A Ensembled LSTM–XGBoost Model

Devine Grace D. Funcion; Marvee Cheska B. Natividad

doi:10.64539/sjer.v2i3.2026.478

Authors

Devine Grace D. Funcion Isabela State University – Cuayan Campus, Philippines
Marvee Cheska B. Natividad Isabela State University – Cuayan Campus, Philippines

DOI:

https://doi.org/10.64539/sjer.v2i3.2026.478

Keywords:

Bootstrapping, Data augmentation, Forecasting, Sliding window, Time-series

Abstract

Nonlinear prediction faces its main challenge in the form of overfitting, which leads to inaccurate predictions. The problem becomes evident when researchers attempt to apply advanced deep learning systems to small agricultural data collections. Researchers use knowledge discovery in databases (KDD) to evaluate regularization methods with ensemble techniques, but have not sufficiently explored how structured data augmentation interacts with MaxNorm regularization. The research explores how a sliding-window transformation, together with bootstrap augmentation methods, works when ElasticNet, Bayesian, and MaxNorm regularization techniques are combined into an LSTM-XGBoost prediction system to predict Tikog grass demand. The research showed that data augmentation techniques helped reduce model overfitting, thereby improving performance on prediction tasks. Among the regularization strategies applied to LSTM, MaxNorm achieved the largest reduction in error, with testing MSE decreasing from 0.060472 to 0.002090 after augmentation. A comparative evaluation further shows that LSTM-XGBoost achieved the highest overall performance (R² = 0.997806), while deep learning models showed greater sensitivity to augmentation and regularization strategies. These findings highlight that structured time-series augmentation combined with norm-based regularization enhances generalization capability, particularly for high-capacity sequence models trained on limited agricultural data.

References

[1] D. Srivastava, H. Pandey, & A. K. Agarwal, "Complex predictive analysis for health care: a comprehensive review.," Bulletin of Electrical Engineering and Informatics, vol. 12, no. 1, pp. 521-531, 2023. https://doi.org/10.11591/eei.v12i1.4373.

[2] N. Rahul, "Optimizing Claims Reserves and Payments with AI: Predictive Models for Financial Accuracy.," International Journal of Emerging Trends in Computer Science and Information Technology, vol. 1, no. 3, pp. 46-55., 2020. https://doi.org/10.63282/3050-9246.IJETCSIT-V1I3P106.

[3] J. M. Kernbach, & V. E. Staartjes , "Foundations of machine learning-based clinical prediction modeling: Part II—Generalization and overfitting.," Machine Learning in Clinical Neuroscience: Foundations and Applications, pp. 15-21., 2021. https://doi.org/10.1007/978-3-030-85292-4_3.

[4] O. A. Montesinos López, A Montesinos López, , & J. Crossa, "Overfitting, model tuning, and evaluation of prediction performance.," In Multivariate statistical machine learning methods for genomic prediction, pp. 109-139, 2022. https://doi.org/10.1007/978-3-030-89010-0_4.

[5] I. D., Mienye, & Y. Sun, "A survey of ensemble learning: Concepts, algorithms, applications, and prospects.," IEEE Access, vol. 10, pp. 99129-99149, 2022. https://doi.org/10.1109/ACCESS.2022.3207287.

[6] E. Esenogho, I. D. Mienye, T. G. Swart, K., Aruleba, & G. Obaido, "A neural network ensemble with feature engineering for improved credit card fraud detection.," IEEE Access, 10, pp. 16400-16407, 2022. https://doi.org/10.1109/ACCESS.2022.3148298.

[7] K. A. Nguyen, W. Chen, B. S. Lin, & U. Seeboonruang, "Comparison of ensemble machine learning methods for soil erosion pin measurements.," ISPRS International Journal of Geo-Information,, vol. 10, no.1, art. No. 42., 2021. https://doi.org/10.3390/ijgi10010042.

[8] I. D. Mienye, T. G. Swart, & G. Obaido, "Recurrent neural networks: A comprehensive review of architectures, variants, and applications.," Information,, vol. 15, no. 9, art. No. 517., 2024. https://doi.org/10.3390/info15090517.

[9] M. Krichen, & A. Mihoub, "Long short-term memory networks: A comprehensive survey.," AI,, vol. 6, no. 9, art. No. 215, 2025. https://doi.org/10.3390/ai6090215.

[10] T. Kandadi, & G. Shankarlingam, "Drawbacks of LSTM Algorithm: A Case Study.," Available at SSRN, p. 5080605., 2025. https://doi.org/10.2139/ssrn.5080605.

[11] T. Liu, J. Bao, J. Wang, & Y. Zhang , "A Ensembled CNN–LSTM algorithm for online defect recognition of CO2 welding.," Sensors, vol. 18, no. 12, art. No. 4369, 2018. https://doi.org/10.3390/s18124369.

[12] M. Noorunnahar, A. H. Chowdhury, & F. A. Mila, "A tree based eXtreme Gradient Boosting (XGBoost) machine learning model to forecast the annual rice production in Bangladesh.," PloS one,, vol. 18, no. 3, p. e0283452, 2023. https://doi.org/10.1371/journal.pone.0283452.

[13] W. Li, Y. Yin, X. Quan, & H. Zhang, "Gene expression value prediction based on XGBoost algorithm.," Frontiers in genetics, vol. 10, pp. 1077, 2019. https://doi.org/10.3389/fgene.2019.01077.

[14] G. Gelete, "Ensembled extreme gradient boosting and nonlinear ensemble models for suspended sediment load prediction in an agricultural catchment.," Water Resources Management, vol. 37 no. 14, pp. 5759-5787, 2023. https://doi.org/10.1007/s11269-023-03629-3.

[15] J. Cao, J. Gao, H. Nikafshan Rad, A. S. Mohammed, M. Hasanipanah, & J. Zhou, "A novel systematic and evolved approach based on XGBoost-firefly algorithm to predict Young’s modulus and unconfined compressive strength of rock.," Engineering with computers,, vol. 38, pp. 3829-3845, 2022. https://doi.org/10.1007/s00366-020-01241-2.

[16] D. F. Kandamali, E. Porter, W. M. Porter, A. McLemore, D. O. Kiobia, A. P. Tavandashti, & G. C. Rains , "Ensembled LSTM Method for Multistep Soil Moisture Prediction Using Historical Soil Moisture and Weather Data.," AgriEngineering, vol. 7, no. 8, art. No. 260, 2025. https://doi.org/10.3390/agriengineering7080260.

[17] Z. Luo, W. Liu, J. Wu, H. Aiqing, & J. Guo , "Prediction of cold chain loading environment for agricultural products based on K-medoids-LSTM-XGBoost ensemble model.," PeerJ Computer Science,, vol. 10, p. e2510., 2024. https://doi.org/10.7717/peerj-cs.2510.

[18] F. Dakheel, & M. Çevik, "Optimizing Smart Grid Load Forecasting via a Ensembled Long Short-Term Memory-XGBoost Framework: Enhancing Accuracy, Robustness, and Energy Management.," Energies, vol. 18, no. 11, art. No. 2842, 2025. https://doi.org/10.3390/en18112842.

[19] Q. Wen, L. Sun, F. Yang, X. Song, J. Gao, X. Wang, & H. Xu, "Time series data augmentation for deep learning: A survey.," in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 4653-4660, 2020. https://doi.org/10.24963/ijcai.2021/631.

[20] Y. Zhou, F. Dong, Y. Liu, Z. Li, J. Du, & L. Zhang, "Forecasting emerging technologies using data augmentation and deep learning.," Scientometrics, vol. 123, no.1, pp. 1-29, 2020. https://doi.org/10.1007/s11192-020-03351-6.

[21] B. K. Iwana, & S. Uchida, "An empirical survey of data augmentation for time series classification with neural networks.," Plos one, vol. 16, no. 7, p. e0254841, 2021. https://doi.org/10.1371/journal.pone.0254841.

[22] T. Kotsilieris, I. Anagnostopoulos, & I. E. Livieris, "Special Issue: Regularization Techniques for Machine Learning and Their Applications.," Electronics, vol. 11, no. 4, art. No. 521, 2022. https://doi.org/10.3390/electronics11040521.

[23] R. Balestriero, L. Bottou, & Y. LeCun, "The effects of regularization and data augmentation are class dependent.," in Proceedings of the 36th International Conference on Neural Information Processing Systems, art. No. 2745, pp. 37878-37891, 2022. https://dl.acm.org/doi/abs/10.5555/3600270.3603015.

[24] R. Moradi, R. Berangi, B. Minaei, "A survey of regularization strategies for deep models.," Artificial Intelligence Review, vol. 53, no. 6, pp. 3947-3986, 2020. https://doi.org/10.1007/s10462-019-09784-7.

[25] A. A., Semenoglou, E. Spiliotis, & V. Assimakopoulos, "Data augmentation for univariate time series forecasting with neural networks.," Pattern Recognition, vol. 134, art. No. 109132, 2023. https://doi.org/10.1016/j.patcog.2022.109132.

[26] A. Debnath, G. Waghmare, H. Wadhwa, S. Asthana, & A. Arora , " Exploring generative data augmentation in multivariate time series forecasting: opportunities and challenges,” Solar-Energy, 2021. https://kdd-milets.github.io/milets2021/papers/MiLeTS2021_paper_7.pdf.

[27] D. Caragea, "Unlocking the potential of bootstrapping: A journey towards balanced and reliable synthetic data A framework for evaluating Bootstrap in the context of synthetic data generation," Master's thesis, 2024. https://studenttheses.uu.nl/handle/20.500.12932/45795.

[28] V. V.Pozevalkin, & A. N. Polyakov, "A model for predicting the temperature of a machine tool structure by a neural network using the sliding window method.," in IOP Conference Series: Materials Science and Engineering, 2021. https://doi.org/10.1088/1757-899X/1061/1/012035.

[29] J. Leukel, J. González, & M. Riekert, "Machine learning-based failure prediction in industrial maintenance: improving performance by sliding window selection.," International Journal of Quality & Reliability Management, vol. 40, no.6, pp. 1449-1462, 2023. https://doi.org/10.1108/IJQRM-12-2021-0439.

[30] D. Kartini, F. Abadi, &T. H. Saragih, "Prediksi Tinggi Permukaan Air Waduk Menggunakan Artificial Neural Network Berbasis Sliding Window.," Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), vol. 5, no.1, pp. 39-44, 2021. https://doi.org/10.29207/resti.v5i1.2602.

[31] W. Jiang, L. Ling, D. Zhang, R. Lin, & L. Zeng, "A time series forecasting model selection framework using CNN and data augmentation for small sample data.," Neural processing letters,, vol. 55, no. 5, pp. 5783-5810, 2023. https://doi.org/10.1007/s11063-022-11113-z.

[32] S. Fitriansyah, Samijan, & D. Kartika, "Medical Product Sales Forecasting for Business Optimization Using Double Exponential Smoothing," Methods in Science and Technology Studies, vol. 1, no. 1, pp. 16-23, 2025. https://doi.org/10.64539/msts.v1i1.2025.24.

[33] T. Bibi, H. Zhou, S. Akbar, and L. Awasthi "Ensembled Machine Learning Framework for Joint Prediction of Window Mean and Bit Error Rate in SC-LDPC Decoding.," Scientific Journal of Engineering Research, vol. 2, no. 1, pp. 33-49, 2026. https://doi.org/10.64539/sjer.v2i1.2026.364.

[34] M. Migunani, A. Setiawan, and I. Sembiring "Optimizing automated machine learning for ensemble performance and overfitting mitigation," Aptisi Transactions on Technopreneurship (ATT), vol. 7, no. 3, pp. 808-822, 2025. https://doi.org/10.34306/att.v7i3.763.

[35] A. V. Tatachar, "Comparative assessment of regression models based on model evaluation metrics.," International Research Journal of Engineering and Technology (IRJET), vol. 8, no. 9, pp. 2395-0056, 2021. https://www.irjet.net/archives/V8/i9/IRJET-V8I9127.pdf.

[36] D. Chicco, M. J. Warrens, & G. Jurman , "The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation.," Peerj computer science, vol. 7, p. e623, 2021. https://doi.org/10.7717/peerj-cs.623.

[37] J. Kaliappan, K. Srinivasan, S. Mian Qaisar, K. Sundararajan, C.-Y. Chang, Suganthan, " Performance evaluation of regression models for the prediction of the COVID-19 reproduction rate.," Frontiers in Public Health, vol. 9, 2021. https://doi.org/10.3389/fpubh.2021.729795.

[38] K. H. Huang, P. Orbanz, and M. Austern, "Gaussian and Non-Gaussian Universality of Data Augmentation.," arXiv preprint arXiv:2202.09134, 2022. https://doi.org/10.48550/arXiv.2202.09134.

[39] S. Volkova, "An overview on data augmentation for machine learning.," in International Scientific and Practical Conference Digital and Information Technologies in Economics and Management, vol. 942, pp 143–154, 2023. https://doi.org/10.1007/978-3-031-55349-3_12.

[40] S. Demir, K. Mincev, K. Kok, & N. G. Paterakis, "Data augmentation for time series regression: Applying transformations, autoencoders and adversarial networks to electricity price forecasting," Applied Energy, vol. 304, art. No. 117695, 2021. https://doi.org/10.1016/j.apenergy.2021.117695.

[41] F. George, M. Khalid, A. Khan, K. Aslansefat, A. R. Rivera, "Evaluating the Effectiveness of Diffusion Models in Cross-Domain Data Augmentation," 2025. https://doi.org/10.2139/ssrn.5087057.

[42] E. Espinosa and A. Figueira, " On the Quality of Synthetic Generated Tabular Data. Mathematics," vol. 11, no. 15, art. No. 3278, 2023. https://doi.org/10.3390/math11153278.

[43] S. H. &. J. J. C. Oh, "Time-series data augmentation based on interpolation.," Procedia Computer Science, vol 175, pp. 64-71, 2020. https://doi.org/10.1016/j.procs.2020.07.012.

[44] R. Dwivedi, C. Singh, B. Yu, & M. J. Wainwright, " Revisiting minimum description length complexity in overparameterized models.," arXiv preprint arXiv:2006.10189, 2020. https://doi.org/10.48550/arXiv.2006.10189.

[45] Y. Y. C. Y. J. S. &. Y. M. Z. Yang, "Rethinking bias-variance trade-off for generalization of neural networks.," in International Conference on Machine Learning, vol. 119, pp. 10767-10777, 2020. https://proceedings.mlr.press/v119/yang20j/yang20j.pdf.

[46] Z. Ju, Y. Chen, Y. Qiang, X. Chen, C. Ju, & J. Yang , " A systematic review of data augmentation methods for intelligent fault diagnosis of rotating machinery under limited data conditions," Measurement Science and Technology, vol. 35, no. 12, art. No. 122004, 2024. https://doi.org/10.1088/1361-6501/ad7a97.

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Cover of the Journal

Quick Access

Editorial Policies

Journal metrics

Information

Template

Flag Counter

Indexing and Abstract

Tools