Wastewater treatment plants (WWTPs) generate vast amounts of water quality, operational, and biological data. The potential of these big data, particularly through machine learning (ML), to improve WWTP management is increasingly recognized. However, the costs associated with data collection and processing can rise sharply as datasets grow larger, and research on determining the optimal data volume for effective ML application remains limited. In this study, we comprehensively analyzed water quality, operational, and biological data collected from a full-scale WWTP over 970 days. Our results demonstrate that ML models can predict not only operational and water quality parameters (concentrations of dissolved oxygen and effluent chemical oxygen demand) but also the abundances of functional bacteria. Notably, we discovered that increasing data volume does not always improve model performance, and that data collection intervals do not need to be excessively small, as moderate intervals can still yield reliable predictions. These findings suggest that excessively large datasets may not be necessary for effective ML predictions in WWTPs. Overall, this study underscores the importance of optimizing dataset size to balance computation efficiency and prediction accuracy, providing valuable insights into data management strategies that can enhance the operational efficiency and sustainability of WWTPs.
Keywords: Bacterial community; Big data; Long short-term memory; Machine learning; Wastewater treatment.
Copyright © 2024 Elsevier Ltd. All rights reserved.