The complexity of the cancer problem domain presents challenges not only to the medical analysis systems tasked with its analysis, but also to the users of such systems. While it is desirable to assist users in operating these medical analysis systems, prior groundwork is required before this can be achieved, such as recognising patterns in the way users create certain analyses within these systems. In this paper, we use machine learning algorithms to analyse user behaviour patterns and attempt to predict the next user interaction within the CARESS medical analysis system. Since an appropriate pre-processing scheme is essential for the performance of these algorithms, we propose the usage of a Natural Language Processing (NLP)- inspired approach to preserve some semantic cohesion of the mostly categorical features of these user interactions. Furthermore, we propose to use a sliding window that contains information about the latest user interactions in combination with Latent Dirichlet Allocation (LDA) to extract a latent topic from these last interactions and use it as additional input to the machine learning models. We compare this pre-processing scheme with other approaches that utilise one-hot encoding and feature hashing. The results of our experiments show that the sliding window LDA scheme is a promising solution, that performs better for our use case than the other evaluated pre-processing schemes. Overall, our results provide an important piece for further research and development in the area of assisting users in operating analysis systems in complex problem domains.
Keywords: LDA; Medical analysis system; NLP; categorical encoding; classification; ensemble learning; feature hashing; one-hot; online learning; pre-processing.