In this work, a classless oversampling technique, Covert, was developed to improve historical datasets from industrial processing plants to aid process modelling. Using kernel density estimation and nearest neighbour algorithms, sparse regions are identified and resampled, developing a more balanced dataset. When applied to a real dataset from a geothermal power plant, Covert outperforms current best practice (Smote) in uniformly populating the input feature space and generating credible data in the output variable. When used to develop a data-driven model Covert improved model accuracy by 20% when predicting outside the original data's feature space. Smote, however, reduced model accuracy by 6% in the same feature space. Developing reliable models of industrial processes continues to be a significant hurdle in developing a digital twin. Using Covert, existing imbalanced historical data can be used to extend the range of applicability of any process model.
Keywords: Classless oversampling; Digital twin; Historical data; Imbalanced data; Imbalanced regression; Kernel density estimate; Process modelling.
Copyright © 2023 The Author(s). Published by Elsevier Ltd.. All rights reserved.