Employing Sentence Space Embedding for Classification of Data Stream from Fake News Domain

Paweł Zyblewski\orcid0000-0002-4224-6709 Corresponding Author. Email: [email protected]    Jakub Klikowski\orcid0000-0002-3825-5514    Weronika Borek-Marciniec\orcid0000-0003-2426-9541    Paweł Ksieniewicz\orcid0000-0001-9578-8395 Department of Systems and Computer Networks, Faculty of Information and Communication Technology, Wrocław University of Science and Technology, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland
Abstract

Tabular data is considered the last unconquered castle of deep learning, yet the task of data stream classification is stated to be an equally important and demanding research area. Due to the temporal constraints, it is assumed that deep learning methods are not the optimal solution for application in this field. However, excluding the entire – and prevalent – group of methods seems rather rash given the progress that has been made in recent years in its development. For this reason, the following paper is the first to present an approach to natural language data stream classification using the sentence space method, which allows for encoding text into the form of a discrete digital signal. This allows the use of convolutional deep networks dedicated to image classification to solve the task of recognizing fake news based on text data. Based on the real-life Fakeddit dataset, the proposed approach was compared with state-of-the-art algorithms for data stream classification based on generalization ability and time complexity.

1 Introduction

There is a widespread opinion among researchers dealing with stream learning that using deep neural networks in this field is a suboptimal solution due to the data processing time [4]. For this reason – contrary to the opinion presented in the literature – tabular data seems not to be the only unconquered castle of deep learning [14]. The literature shows that neural networks are being gradually adapted to process data streams [10]. However, it remains one of the current research problems in the field of deep learning [4].

Excluding the entire pool of solutions characterized often by better quality than classical methods by assuming too high time complexity as the main argument may be hasty. Currently, networks are becoming increasingly popular, mainly thanks to tools such as Chat-GPT, DALL-e and Sora [32].

The following work closely examines using deep networks in processing of data streams. The research was based on state-of-the-art methods dedicated to streaming data to ensure batch processing and take into account common problems affecting streams, such as the need for active model training or the occurrence of concept drift [17].

This problem is juxtaposed with the second task, which is particularly important in the era of general availability of large language models, i.e., the recognition of fake news which are understood here as content constructed in a way that deliberately misleads the recipient to benefit the distributor. Online portals and social media platforms cause a flood of information, which requires stream processing, so models have a chance to adapt to new facts and language dynamics [19]. Machine learning models cannot process text data in their raw form, which poses an additional challenge to processing time in stream learning.

In conjunction with neural networks for Natural Language Processing (nlp) tasks, embeddings are almost always used [13]. It is an advantage of embeddings over canonical methods based on n-grams, which maintain the semantic connection only with words in the local context. Nevertheless, there are methods based on non-neural network solutions in the literature that allow for achieving high quality in the classification of fake news [20].

However, in the end, text preprocessing methods transform it into tabular data with a specific representation. There are models dedicated to text recognition problems that are adapted to such tasks and offer pre-trained weights [8]. However, the pool of such solutions is still smaller than the pool of methods available for image processing, for which convolutional networks are the primary classification tool [23]. The situation is similar with canonical tabular data, so methods that transform tables into images have already been proposed and used with deep networks have given promising results. This became the motivation for using this solution for text data in the following paper. For this purpose, the sentence space [15] representation was chosen, which allows the creation of an image even in the case of short texts, such as article titles.

Main contributions of this work are:

  • proposal of the Streaming Sentence Space (SSS) approach to – for the first time in the literature – use sentence space encoding to classify data streams containing text,

  • developing the field of application of deep learning in data stream classification task, which is now considered one of the main research directions related to deep neural networks,

  • a comparison of SSS with state-of-the-art data stream ensemble classification algorithms in terms of classification accuracy and computational complexity.

2 Related works

This section presents the foundations of the proposed solution, both from the point of view of nlp and data stream processing.

2.1 Text data extraction methods

At the core of Natural Language Processing lies the challenge of converting natural language content into a numerical feature space that preserves the document’s semantic information. The bag-of-words method, a fundamental text feature extraction technique, calculates specific word occurrences within individual samples [21]. This approach forms the basis for more advanced methods such as bag-of-n-grams, which retain the contextual information of specific words and thereby a certain semantic association [11]. The tf-idf method [33] further enhances this vector-based approach by incorporating two key statistics: Term Frequency (tf), which measures word frequency within a document, and Inverse Document Frequency (idf), a logarithmic measure of word uniqueness within a corpus.

In pursuit of the need to reduce dimensionality and the feature vector notation in continuous space, the continuous bag-of-words approach was invented, which, along with the continuous skip-gram model, is more widely known as Word2Vec (W2V[25]. This method allows the determination of a vector representation of a given length for each word in the corpus. Despite its innovation, this approach (i) fails to handle languages that are highly morphologically rich, and (ii) determines embeddings only based on local word relationships in the corpus. The answer to the first problem is the FastText approach [3], which, relying on the Word2Vec idea, performs additional word segmentation into character-n-grams, significantly enhancing the context during processing. In response to the second drawback, the Global Vectors (GloVe[31] model was developed, which is also based on Word2Vec, but in addition, extends the model to include general statistics from the processed corpus using a global word-word co-occurrence matrix.

Among vectorization methods, Large Language Models (llms) are at the top-notch among the most commonly used approaches [13]. They use an artificial neural network structure called a transformer as their basis [37]. Their significant advantage is the employment of self-attention heads that powerfully extend the context, and this, in combination with the processing of massive linguistic resources, generates promising text representations [26]. One of the most popular large language models is Bidirectional Encoder Representations from Transformers (bert[8], which is designed to pre-train deep bidirectional representations from unlabeled text. As a result, it can capture the dependencies present throughout the text. Additionally, there are no words but subword units called WordPieces at the foundation of the bert model.

However, despite the many advantages of large language models, they have a fair amount of complexity, affecting the time required to determine vectors. A MiniLM model [40] is an interesting approach where the authors train a reduced model using knowledge distillation from the bert, maintaining a quality similar to the original. The primary transformer idea draws inspiration from the sequence processing concept – the seq2seq model [36]. The expected input is a sequence of tokens forming a longer text. In contrast, a more intuitive solution would be to employ word embeddings in processing a single word into a vector space.

2.2 Multi-Dimensional Encoding of text data

Access to massive volumes of data and increased processing power promotes the usage of deep learning techniques. They are the foundation of multimodal data processing, which is primarily reliant on computer vision tasks, where deep methods frequently outperform canonical approaches [29]. Numerous scientific articles confirm that convolutional networks are successfully utilized for image, video, natural language [12], and audio classification (e.g., in spectrogram form) [34] in both unimodal and multimodal settings. Deep networks facilitate transfer learning, enabling models to apply previously learned information to the task at hand [42].

Although the term multi-dimensional encoding is mainly used for tabular data, the Sentence Space proposed by Kim [15] can be considered as its equivalent for text corpora. Using sentence space, text data is transformed into an image in which each row contains embeddings of individual words for each text sample. This approach is clearly dependent on the configuration of the convolutional neural network architecture used, as noted and studied in their work by Zhang and Wallace [41], analyzing possible configurations of one-layer cnns. In turn, Le et al. [22] analyzed the effect of the depth of convolutional neural networks on sentence space. Lately, an extension of the original concept of text encoding to a two-dimensional discrete digital signal was proposed by Soni et al. [35] in the form of TextConvoNet. This approach extracts the n-gram features within a sentence and captures the n-gram features between sentences in the input text data, resulting in a three-dimensional representation.

2.3 Classifier ensemble for imbalanced data stream

Despite over three decades of research and numerous techniques now accessible, classifying drifting imbalanced data streams remains one of the important machine learning topics. Methods designed for this type of data can work in online manner, where each instance is analyzed individually, or on batches of data, where the stream is processed in non-interlacing windows. This work focuses on batch processing, which, because to the larger training set, may provide improved classification quality in the current concept, but is accompanied with a delayed reaction due to having to wait for the next data batch to be available [1].

Data imbalance is a prevalent problem in data streams, where the imbalance ratio can be static or dynamic [2]. Most real streams do not have a fixed imbalance ratio, and their properties might change over time [39]. As a result, data stream classification methods should achieve good classification quality regardless of class distribution, however most techniques built for imbalanced data streams produce unsatisfactory results when class sizes are similar. At the same time, algorithms designed with balanced data in mind, frequently have difficulty with the correct classification of data streams with skewed class distribution [6]. Methods designed for for dealing with imbalanced data are separated into two main groups: (i) data-level approaches and (ii) algorithm-level techniques [1]. The first group focuses on data preprocessing, canonically by employing oversampling or undersampling, to change its characteristics prior to classification attempt to alleviate the bias towards majority class, whereas algorithm-level approaches focus on modifying classification algorithms’ training phase.

The most prevalent methods for imbalanced data stream classification employ classifier ensembles coupled with data preprocessing techniques. By assuring diversity, constantly updating the classifier pool, and combining the available models, it is feasible to increase the generalization ability and allow for dynamic adaptation to changes in the stream’s characteristics [5]. Among the established algorithms, we can distinguish Learn++.CDS (Concept Drift with Smote) and Learn++.NIE (Nonstationary and Imbalanced Environments) by Ditzler and Polikar [9]. Learn++.CDS extends the Learn++.NIE (Non-Stationary Environments) algorithm by employing the smote in attempt to balance the number of samples in each class, while Learn++.NIE utilizes a penalty constraint to balance classification accuracy on all classes, while also employing a bagging-based sub-ensemble. Wang et al. proposed the Oversampling Online Bagging (oob) and Undersampling Online Bagging (uob) algorithms dedicated for online data stream processing, extending Online Bagging by altering the Poisson distribution λ𝜆\lambdaitalic_λ parameter according to the current imbalance ratio [38]. Cano and Krawczyk developed Kappa Updated Ensemble (kue[6], which combines batch-based and online processing on feature subspaces. kue uses the Kappa statistic to dynamically weight and select base classifiers. The same authors introduced also the Robust Online Self-Adjusting Ensemble (rose) for non-stationary data stream classification [7]. This method trains online learners based on data views, ensuring a diverse classifier pool. It also employs drift detectors to respond quickly to changes in data distribution and proposes effective strategies for dealing with data imbalance. Wozniak et al. used built-in mechanisms (e.g., weighting and aging of classification models) to establish a self-updating classifier pool that can adjust its lineup in response to changes in imbalance ratio and concept drift. Klikowski and Woźniak trained one-class classifiers using clustered data [16], while Zyblewski et al. proposed to combine Dynamic Classifier Selection with data preprocessing techniques for imbalanced data stream classification [43].

3 Streaming Sentence Space

As observed in the introduction of this paper, solutions based on deep learning architectures are often overlooked in data stream classification tasks. This is due to concerns about increased computational and time complexity, in both the induction and inference process [24], despite the tremendous recent progress made in the area of deep learning. One promising solution is Sentence Space, a multi-dimensional encoding counterpart for text data. Despite numerous works confirming the performance of sentence space and its derivatives, no studies analyzing the use of this encoding in the task of classifying streams containing text data have been produced so far. In order to fill this niche, this paper proposes Streaming Sentence Space (sss), thus taking the first step toward analyzing the application of sentence space in the classification of dynamically imbalanced data streams from fake news domain.

The main assumption behind the presented research was to keep the time complexity as low as possible and to enable the use of SSS in real-life data stream classification tasks while maintaining the generalization potential inherent to convolutional neural networks. Accordingly, this work analyzes only a batch-based processing scenario, in which prediction and model training are performed on a window of predefined size. The need to wait for a single data chunk to fill up, depending on the dynamics of the data stream, can significantly reduce problems arising from possibly increased processing time.

The basis of sss is the conventional sentence space encoding [15], in which individual words are transformed into embeddings that represent consecutive lines of an image. Of course, it is also possible to use approaches such as TextConvoNet, but this depends on the characteristics of the data being analyzed. The decision to use sentence space in this case was related to the relative short length of the texts contained in the stream corpus (more in Section 4). As for the convolutional network, the decision was made to use the popular ResNet-18 architecture with the assumption of only one training epoch in each successive data chunks. The standard and commonly employed optimizer sgd with learning rate of 0.001 and momentum of 0.9 was used, batch size was set to 8. Cross-entropy loss was used as the loss function.

Figure 1 illustrates sss-based data stream processing. We regard the data stream as a sequence of text data chunks DSkT𝐷subscriptsuperscript𝑆𝑇𝑘DS^{T}_{k}italic_D italic_S start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with a fixed size of N𝑁Nitalic_N, where k𝑘kitalic_k is the batch index. sss encodes each incoming data chunk into a series of two-dimensional discrete digital signals DSkI𝐷subscriptsuperscript𝑆𝐼𝑘DS^{I}_{k}italic_D italic_S start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that contain N𝑁Nitalic_N pictures with a predetermined side size. Each N𝑁Nitalic_N picture from DSkI𝐷subscriptsuperscript𝑆𝐼𝑘DS^{I}_{k}italic_D italic_S start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is copied three times to provide an image representation with three color channels for the ResNet-18 architecture. ResNet-18 follows the Test-Then-Train protocol, performing inference and one training epoch for each data batch DSkI𝐷subscriptsuperscript𝑆𝐼𝑘DS^{I}_{k}italic_D italic_S start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Refer to caption
Figure 1: The general scheme of the proposed SSS approach.
Refer to caption
Figure 2: Results of preliminary experiments related to image size and transfer learning.

One of the fundamental problems arising from sentence space is the approach to determining the dimensions of the images resulting from encoding. The width here is typically the length of the embedding vector, but selecting the height is no longer a trivial task, as it depends on the number of words contained in the text. Approaches based on padding or clipping texts to a specific length can be used here, but in the case of sss, it was decided to resize the heights of the images by bilinear interpolation. The first two subplots of Fig 2 show the experimental process of selecting image heights for the Fakeddit dataset analyzed. Text embeddings were obtained using the GloVe technique, which is a more recent alternative to Word2Vec usually used for this purpose and offers better recognition quality in many problems. After analyzing the distribution of the number of words in the corpus texts, it was found that almost all of them were in the range of up to 50 words in length. Due to this observation, 50x300 px was set as the initial dimensions of sentence space images after resize. In addition, this experiment was repeated for dimensions of 100x300 px and 200x300 px. The obtained values of balanced accuracy score, although very close, indicate the advantage of images with a height of 200 px, and therefore this is the value used in the experiments presented next.

In addition, due to the relatively unusual characteristics of the resulting images of sentence space encoding and the possibility of negative knowledge transfer, a short experiment was conducted to determine the validity of using the ResNet-18 architecture pre-trained on the ImageNet dataset. In the image height experiment, transfer learning was applied by default, as is the case in many research articles, but in this case it was decided to repeat the study using the Resnet model architecture learned from scratch. The results obtained, presented in the last subplot of Fig 2, indicate a minimal advantage of the pre-trained network when classifying images resulting from sentence space encoding.

4 Experimental Evaluation

The experimental study conducted to evaluate the performance of the SSS was designed to answer the following research questions:

  • RQ1 Which of the commonly used approaches for obtaining representations from text data for pattern recognition tasks should be used in conjunction with Sentence Space to obtain images that allow cnns to achieve the highest generalization capability?

  • RQ2 Does the use of sss make it possible to achieve classification quality superior to state-of-the-art ensemble data stream classification algorithms trained using representations obtained from commonly used extractors?

  • RQ3 How does the time complexity of sss compare to state-of-the-art ensemble data stream classification algorithms, and does it enable its use in real-life data stream classification tasks?

4.1 Set-up

Data All of the research was conducted using Fakeddit’s multimodal dataset, which presents a real-life fake news classification task broken down into two, three or six classes [28]. This dataset consists of more than one million posts on 22 different subreddits of the social networking platform Reddit and includes text and image modalities, supplemented by metadata about the posts and their authors. For the purposes of this study, a single binary data stream was prepared, in which consecutive texts were sorted accordingly to their creation timestamp. Of the entire dataset, 682,996 multimodal samples were used (the rest have only one modality). The decision to limit to multimodal samples only is linked to the facilitation of extending the presented research to include the image modality. The data stream was divided into 2731 data chunks, containing 250 samples each. Relevant to the research course is the fact that the resulting data stream is characterized by a dynamic imbalance ratio, which changes in successive batches to the point where, at about 3/4 of the length of the stream, a minority class transitions into a majority class. The exact changes in the prior probability of class membership are shown in Fig. 3.

Refer to caption
Figure 3: Changes in the prior class probabilities over time.

Experimental protocol & reproducibility All experiments were carried out using the Test-Then-Train protocol to guarantee a robust experimental evaluation, were implemented in Python and can be replicated using the publicly available GitHub repository111https://github.com/w4k2/sentence-space-stream. Implementation of state-of-the-art algorithms were based on stream-learn [18], scikit-multiflow [27], and PyTorch [30] libraries. The classification quality evaluation of the algorithms was based on the standard metrics used in the task of imbalanced data classification, i.e. balanced accuracy score (bac), recall, specificity, precision, F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score, Gmean𝐺𝑚𝑒𝑎𝑛Gmeanitalic_G italic_m italic_e italic_a italic_n, and Gmeans𝐺𝑚𝑒𝑎subscript𝑛𝑠Gmean_{s}italic_G italic_m italic_e italic_a italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

4.2 Experiment scenarios

Experiment 1 – Extraction methods The goal of Experiment 1 was to compare the performance of sss depending on the type of extractor used to obtain a representation for Sentence Space encoding. For this purpose, (i) GloVe, (ii) MiniLM, (iii) pre-trained Word2Vec, and (iv) Word2Vec updated after each data chunk were compared with each other. The representation width depending on the extractor was 380380380380 for MiniLM and 300300300300 for GloVe and W2V. Based on the results, the feature extraction method used in subsequent experiments was selected.

Experiment 2 – Comparison with data stream classification algorithms In Experiment 2, the sss based on the extractor chosen in Experiment 1 was compared with state-of-the-art ensemble algorithms for imbalanced data stream classification. Among the methods mentioned in the literature review, (i) Hoeffding Tree with Hellinger split criterion (hf), (ii) Learn++.CDS (cds), (iii) Learn++.NIE (nie), (iv) Kappa Updated Ensemble (kue), and (v) Robust Online Self-Adjusting Ensemble (rose) were selected as references. hf was used as the base classifier for all reference methods, and the maximum size of the classifier pool was set to 10101010. The selection of methods, base classifier, and pool size was based on the literature [1, 6, 7]. The entire set of reference algorithms was compared with the SSS depending on the approach used to extract features from the text. In addition to the extractors used in Experiment 1, tf-idf with unigrams and bigrams and 100 features with top term frequency was employed here. A set of reference methods trained using the representation that provided the highest classification quality in terms of bac was selected for the last experiment.

Experiment 3 – Time complexity The last experiment was designed to analyze the emerged methods in terms of time complexity. For this purpose, for both sss and reference methods, the feature extraction, prediction and training times for the first 110 data chunks from the Fakeddit stream were measured, respectively. To account for the processing time of reference methods only for the classifier pool with the maximum number of models, the first 10 data chunks were ignored. The experiment was repeated 10 times to stabilize the results obtained.

4.3 Experiment 1 – Extraction methods

As can be seen in Figure 4, across all data chunks, the results of the first three methods GloVe, MiniLM, pre-trained Word2Vec are similar, with the only deviating method being partial-fit Word2Vec trained in each chunk. This discrepancy can be explained by the dictionary’s limitations built on the training data, consisting of article titles, primarily short texts. In contrast, the MiniLM model and Word2Vec pre-trained (word2vec-google-news-300) based on the Google News dataset, as well as GloVe (glove-wiki-gigaword-300), which means they were all built on larger volumes of data compared to the own trained Word2Vec.

Following this experiment, it was decided to use the GloVe vectors for further research, which is a newer method than Word2Vec that processes additional global information and, at the same time – unlike MiniLM, dedicated to sequential processing – was designed to process single words. In addition, the MiniLM model requires more computational complexity when determining the representation vector for a given word, which for Word2Vec and Glove methods is limited only to reading values from the vector array. With equal effectiveness, the exclusion of MiniLM is fully justified.

Refer to caption
Figure 4: Results of an experiment to determine the best extraction method for SSS.

4.4 Experiment 2 – Comparison with data stream classification algorithms

The preliminary part of main comparative experiment has begun from analysis of feature space reduction influence. It was conducted by Principal Component Analysis (pca) – projecting the original embedding space down into 100 features. As it can be observed in Figure 5, influence of this simple reduction is mostly cosmetic, always laying in one percent margin of a difference, so it is justified to reduce problem representation for canonical models, since representation gets smaller while change in quality is negligible.

Refer to caption
Figure 5: Results of an experiment to decide whether to use PCA for dimensionality reduction of representations for reference methods.

The main comparison of recognition efficiency, taking into account the deep strategy already proposed in this work (sss), is presented in Figure 6, divided into four empirical analyzes according to the extraction strategies used for canonical methods. In the case of each approach, the inglorious laggard turns out to be the nie method, which maintains its model at the level of a random classifier for a very long time, and in two cases (Glove and MiniLM) raises it only slightly in the final phase of the stream, after the prior concept drift shown in Fig. 3.

Refer to caption
Figure 6: Comparison of SSS with reference methods depending on the extraction method used.

The weakest method in the main rate (excluding nie) turns out to be Learning++CDS, which for all extractors except MiniLM clearly leans towards randomness. All analyzed reference solutions appear to be very sensitive to concept drift occurring around the chunk 2,000. The ranking of kue, rose and raw Hoeffding Tree methods depends on the extraction strategy used, but the overall distribution of their effectiveness is rather similar, with a slight advantage for rose.

sss as a deep learning strategy shows a noticeable advantage over all canonical solutions throughout the entire data stream, being the only one in the competition to maintain an average recognition efficiency of 80 percent of balanced accuracy score. This observation is supported by the extended analysis of metrics in the form of radar (Figure 7), where an advantage of sss can be observed in each of the simple and aggregate metrics used. The outlier specificity result for the nie method results solely from its complete inability to learn in the analyzed problem environment, which for the dominant majority of processing time induces decisions towards one of the problem classes, without a generalized relation to given bias.

Refer to caption
Figure 7: Comparison of SSS with reference methods trained using MiniLM embeddings.

Among the analyzed extraction methods, MiniLM comes minimaly to the fore, as it was the only one that allowed the nie strategy to noticeably rise from the random classifier level in the final part of the stream, and the cds to compete with the rest of the competition.

4.5 Experiment 3 – Time complexity

Refer to caption
Figure 8: Comparison of SSS with reference methods in terms of time complexity.

The results of the third experiment – showing the time complexity for extracting, training, and testing the algorithm – are shown in Figure 8. As we can see, in the case of preprocessing, all reference methods show uniform computation time (lines overlap), and only the proposed approach deviates from this tendency and performs extraction faster. Additionally, it should be noted that this happens despite the time measurement considering the transition to image representation. The method owes it using the GloVe technique for sss, which in Experiment 1 had the best results for set space encoding. The reference methods, however, use the MiniLM transformer in accordance with the outcome of Experiment 2.

More variability can be observed in the training and testing processes – in both, a single Hoeffding Tree processes the fastest. In turn, training in a single epoch is the slowest for rose, and prediction – for nie. It is also reflected in the accumulated time graph, which indicates that the only method ahead of sss in the entire processing is the Hoeffding Tree. Still, it should be borne in mind that a single classifier is considered here, not an ensemble.

Therefore, the proposed method is ahead of classifiers ensembles used for data streams even though it is based on convolutional networks and requires transformation from text into an image. At the same time, despite the lowest processing time compared to state-of-the-art ensemble algorithms, sss offers the highest generalization ability. All this means that sss can be successfully used in real-life batch-based data stream classification tasks.

5 Conclusion

The presented research work aimed to achieve two main goals. The first was to address the application of deep learning in the task of data stream classification, which is presented in the current literature as one of the main research areas in the need of further investigation. The second goal was to propose – for the first time in the literature – the use of sentence space, which is the equivalent of multi-dimensional encoding for text data, in a data stream classification task.

To realize the above goals, Streaming Sentence Space (sss) was proposed, which encodes the text found in individual data batches into discrete digital signals based on embeddings obtained through the GloVe technique. The resulting images are then classified using the ResNet-18 architecture, which, in order to reduce computational complexity, performs only a single training epoch on each data chunk.

The developed approach was tested on the basis of computer experiments conducted on a real-life dynamically imbalanced data stream formed by chronologically ordering the texts contained in the Fakeddit dataset. The results showed that sss, thanks to the inherent generalization ability of the convolutional neural network, is able to outperform the classification quality of state-of-the-art classifier ensemble methods dedicated for imbalanced data stream classification. In addition, sss exhibits lower time complexity than ensemble reference methods, which further encourages its use and contradicts the popular opinion that deep learning has too high time and computational complexity to be used for data stream analysis.

Future research may focus on examining the applicability of other sentence space derived techniques for encoding text into image form in the task of data stream classification. Another potentially interesting direction is the application of sentence space-based approaches in the classification of multimodal data streams containing text modality.

References

  • Aguiar et al. [2023] G. Aguiar, B. Krawczyk, and A. Cano. A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. Machine learning, pages 1–79, 2023.
  • Aminian et al. [2019] E. Aminian, R. P. Ribeiro, and J. Gama. A study on imbalanced data streams. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 380–389. Springer, 2019.
  • Bojanowski et al. [2017] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5:135–146, 06 2017. ISSN 2307-387X. 10.1162/tacl_a_00051.
  • Borisov et al. [2022] V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci. Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • Brzezinski and Stefanowski [2018] D. Brzezinski and J. Stefanowski. Ensemble classifiers for imbalanced and evolving data streams. In Data mining in time series and streaming databases, pages 44–68. World Scientific, 2018.
  • Cano and Krawczyk [2020] A. Cano and B. Krawczyk. Kappa updated ensemble for drifting data stream mining. Machine Learning, 109(1):175–218, 2020.
  • Cano and Krawczyk [2022] A. Cano and B. Krawczyk. Rose: robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams. Machine Learning, 111(7):2561–2599, 2022.
  • Devlin et al. [2018] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Ditzler and Polikar [2013] G. Ditzler and R. Polikar. Incremental learning of concept drift from streaming imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 25(10):2283–2301, Oct 2013. ISSN 1041-4347.
  • Duda et al. [2020] P. Duda, M. Jaworski, A. Cader, and L. Wang. On training deep neural networks using a streaming approach. Journal of Artificial Intelligence and Soft Computing Research, 10(1):15–26, 2020.
  • Fürnkranz [1998] J. Fürnkranz. A study using n-gram features for text categorization. Austrian Research Institute for Artifical Intelligence, 3(1998):1–10, 1998.
  • Gimenez et al. [2020] M. Gimenez, J. Palanca, and V. Botti. Semantic-based padding in convolutional neural networks for improving the performance in natural language processing. a case of study in sentiment analysis. Neurocomputing, 378:315–323, 2020.
  • Incitti et al. [2023] F. Incitti, F. Urli, and L. Snidaro. Beyond word embeddings: A survey. Information Fusion, 89:418–436, 2023. ISSN 1566-2535. https://doi.org/10.1016/j.inffus.2022.08.024.
  • Kadra et al. [2021] A. Kadra, M. Lindauer, F. Hutter, and J. Grabocka. Well-tuned simple nets excel on tabular datasets. Advances in neural information processing systems, 34:23928–23941, 2021.
  • Kim [2014] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
  • Klikowski and Woźniak [2020] J. Klikowski and M. Woźniak. Employing one-class svm classifier ensemble for imbalanced data stream classification. In Computational Science–ICCS 2020: 20th International Conference, Amsterdam, The Netherlands, June 3–5, 2020, Proceedings, Part IV 20, pages 117–127. Springer, 2020.
  • Komorniczak and Ksieniewicz [2023] J. Komorniczak and P. Ksieniewicz. Complexity-based drift detection for nonstationary data streams. Neurocomputing, 552:126554, 2023.
  • Ksieniewicz and Zyblewski [2022] P. Ksieniewicz and P. Zyblewski. Stream-learn—open-source python library for difficult data stream batch analysis. Neurocomputing, 478:11–21, 2022.
  • Ksieniewicz et al. [2020] P. Ksieniewicz, P. Zyblewski, M. Choraś, R. Kozik, A. Giełczyk, and M. Woźniak. Fake news detection from data streams. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
  • Ksieniewicz et al. [2023] P. Ksieniewicz, P. Zyblewski, W. Borek-Marciniec, R. Kozik, M. Choraś, and M. Woźniak. Alphabet flatting as a variant of n-gram feature extraction method in ensemble classification of fake news. Engineering Applications of Artificial Intelligence, 120:105882, 2023.
  • Lang [1995] K. Lang. Newsweeder: Learning to filter netnews. In A. Prieditis and S. Russell, editors, Machine Learning Proceedings 1995, pages 331–339. Morgan Kaufmann, San Francisco (CA), 1995. ISBN 978-1-55860-377-6. https://doi.org/10.1016/B978-1-55860-377-6.50048-7.
  • Le et al. [2018] H. T. Le, C. Cerisara, and A. Denis. Do convolutional networks need to be deep for text classification? In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Li et al. [2021] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems, 33(12):6999–7019, 2021.
  • Michalski [1993] R. S. Michalski. Inferential theory of learning as a conceptual basis for multistrategy learning. Machine learning, 11:111–151, 1993.
  • Mikolov et al. [2013] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • Min et al. [2023] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth. Recent advances in natural language processing via large pre-trained language models: A survey. ACM Comput. Surv., 56(2), sep 2023. ISSN 0360-0300. 10.1145/3605943.
  • Montiel et al. [2018] J. Montiel, J. Read, A. Bifet, and T. Abdessalem. Scikit-multiflow: A multi-output streaming framework. Journal of Machine Learning Research, 19(72):1–5, 2018.
  • Nakamura et al. [2019] K. Nakamura, S. Levy, and W. Y. Wang. r/fakeddit: A new multimodal benchmark dataset for fine-grained fake news detection. arXiv preprint arXiv:1911.03854, 2019.
  • O’Mahony et al. [2020] N. O’Mahony, S. Campbell, A. Carvalho, S. Harapanahalli, G. V. Hernandez, L. Krpalkova, D. Riordan, and J. Walsh. Deep learning vs. traditional computer vision. In Advances in Computer Vision: Proceedings of the 2019 Computer Vision Conference (CVC), Volume 1 1, pages 128–144. Springer, 2020.
  • Paszke et al. [2017] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • Pennington et al. [2014] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  • Qin et al. [2023] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, and D. Yang. Is ChatGPT a general-purpose natural language processing task solver? In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1339–1384, Singapore, Dec. 2023. Association for Computational Linguistics. 10.18653/v1/2023.emnlp-main.85. URL https://aclanthology.org/2023.emnlp-main.85.
  • Salton and Buckley [1987] G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical report, Cornell University, 1987.
  • Satt et al. [2017] A. Satt, S. Rozenberg, R. Hoory, et al. Efficient emotion recognition from speech using deep learning on spectrograms. In Interspeech, pages 1089–1093, 2017.
  • Soni et al. [2023] S. Soni, S. S. Chouhan, and S. S. Rathore. Textconvonet: A convolutional neural network based architecture for text classification. Applied Intelligence, 53(11):14249–14268, 2023.
  • Sutskever et al. [2014] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  • Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • Wang et al. [2015] S. Wang, L. L. Minku, and X. Yao. Resampling-based ensemble methods for online class imbalance learning. IEEE Transactions on Knowledge and Data Engineering, 27(5):1356–1368, May 2015. ISSN 2326-3865.
  • Wang et al. [2018] S. Wang, L. L. Minku, and X. Yao. A systematic study of online class imbalance learning with concept drift. IEEE transactions on neural networks and learning systems, 29(10):4802–4821, 2018.
  • Wang et al. [2020] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5776–5788. Curran Associates, Inc., 2020.
  • Zhang and Wallace [2015] Y. Zhang and B. Wallace. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820, 2015.
  • Zhuang et al. [2020] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.
  • Zyblewski et al. [2021] P. Zyblewski, R. Sabourin, and M. Woźniak. Preprocessed dynamic classifier ensemble selection for highly imbalanced drifted data streams. Information Fusion, 66:138–154, 2021.