Search | arXiv e-print repository

doi 10.1007/s11222-023-10309-0

Random Forest Kernel for High-Dimension Low Sample Size Classification

Authors: Lucca Portes Cavalheiro, Simon Bernard, Jean Paul Barddal, Laurent Heutte

Abstract: High dimension, low sample size (HDLSS) problems are numerous among real-world applications of machine learning. From medical images to text processing, traditional machine learning algorithms are usually unsuccessful in learning the best possible concept from such data. In a previous work, we proposed a dissimilarity-based approach for multi-view classification, the Random Forest Dissimilarity (R… ▽ More High dimension, low sample size (HDLSS) problems are numerous among real-world applications of machine learning. From medical images to text processing, traditional machine learning algorithms are usually unsuccessful in learning the best possible concept from such data. In a previous work, we proposed a dissimilarity-based approach for multi-view classification, the Random Forest Dissimilarity (RFD), that perfoms state-of-the-art results for such problems. In this work, we transpose the core principle of this approach to solving HDLSS classification problems, by using the RF similarity measure as a learned precomputed SVM kernel (RFSVM). We show that such a learned similarity measure is particularly suited and accurate for this classification context. Experiments conducted on 40 public HDLSS classification datasets, supported by rigorous statistical analyses, show that the RFSVM method outperforms existing methods for the majority of HDLSS problems and remains at the same time very competitive for low or non-HDLSS problems. △ Less

Submitted 17 November, 2023; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: 23 pages. To be published in statistics and computing (accepted September 26, 2023)

Journal ref: Stat Comput 34, 9 (2024)

arXiv:2301.12873 [pdf, other]

Approximating DTW with a convolutional neural network on EEG data

Authors: Hugo Lerogeron, Romain Picot-Clemente, Alain Rakotomamonjy, Laurent Heutte

Abstract: Dynamic Time Wrapping (DTW) is a widely used algorithm for measuring similarities between two time series. It is especially valuable in a wide variety of applications, such as clustering, anomaly detection, classification, or video segmentation, where the time-series have different timescales, are irregularly sampled, or are shifted. However, it is not prone to be considered as a loss function in… ▽ More Dynamic Time Wrapping (DTW) is a widely used algorithm for measuring similarities between two time series. It is especially valuable in a wide variety of applications, such as clustering, anomaly detection, classification, or video segmentation, where the time-series have different timescales, are irregularly sampled, or are shifted. However, it is not prone to be considered as a loss function in an end-to-end learning framework because of its non-differentiability and its quadratic temporal complexity. While differentiable variants of DTW have been introduced by the community, they still present some drawbacks: computing the distance is still expensive and this similarity tends to blur some differences in the time-series. In this paper, we propose a fast and differentiable approximation of DTW by comparing two architectures: the first one for learning an embedding in which the Euclidean distance mimics the DTW, and the second one for directly predicting the DTW output using regression. We build the former by training a siamese neural network to regress the DTW value between two time-series. Depending on the nature of the activation function, this approximation naturally supports differentiation, and it is efficient to compute. We show, in a time-series retrieval context on EEG datasets, that our methods achieve at least the same level of accuracy as other DTW main approximations with higher computational efficiency. We also show that it can be used to learn in an end-to-end setting on long time series by proposing generative models of EEGs. △ Less

Submitted 30 January, 2023; originally announced January 2023.

ACM Class: I.5

arXiv:2208.02397 [pdf, other]

Pattern Spotting and Image Retrieval in Historical Documents using Deep Hashing

Authors: Caio da S. Dias, Alceu de S. Britto Jr., Jean P. Barddal, Laurent Heutte, Alessandro L. Koerich

Abstract: This paper presents a deep learning approach for image retrieval and pattern spotting in digital collections of historical documents. First, a region proposal algorithm detects object candidates in the document page images. Next, deep learning models are used for feature extraction, considering two distinct variants, which provide either real-valued or binary code representations. Finally, candida… ▽ More This paper presents a deep learning approach for image retrieval and pattern spotting in digital collections of historical documents. First, a region proposal algorithm detects object candidates in the document page images. Next, deep learning models are used for feature extraction, considering two distinct variants, which provide either real-valued or binary code representations. Finally, candidate images are ranked by computing the feature similarity with a given input query. A robust experimental protocol evaluates the proposed approach considering each representation scheme (real-valued and binary code) on the DocExplore image database. The experimental results show that the proposed deep models compare favorably to the state-of-the-art image retrieval approaches for images of historical documents, outperforming other deep models by 2.56 percentage points using the same techniques for pattern spotting. Besides, the proposed approach also reduces the search time by up to 200x and the storage cost up to 6,000x when compared to related works based on real-valued representations. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: 7 pages

arXiv:2008.08920 [pdf, ps, other]

scikit-dyn2sel -- A Dynamic Selection Framework for Data Streams

Authors: Lucca Portes Cavalheiro, Jean Paul Barddal, Alceu de Souza Britto Jr, Laurent Heutte

Abstract: Mining data streams is a challenge per se. It must be ready to deal with an enormous amount of data and with problems not present in batch machine learning, such as concept drift. Therefore, applying a batch-designed technique, such as dynamic selection of classifiers (DCS) also presents a challenge. The dynamic characteristic of ensembles that deal with streams presents barriers to the applicatio… ▽ More Mining data streams is a challenge per se. It must be ready to deal with an enormous amount of data and with problems not present in batch machine learning, such as concept drift. Therefore, applying a batch-designed technique, such as dynamic selection of classifiers (DCS) also presents a challenge. The dynamic characteristic of ensembles that deal with streams presents barriers to the application of traditional DCS techniques in such classifiers. scikit-dyn2sel is an open-source python library tailored for dynamic selection techniques in streaming data. scikit-dyn2sel's development follows code quality and testing standards, including PEP8 compliance and automated high test coverage using codecov.io and circleci.com. Source code, documentation, and examples are made available on GitHub at https://github.com/luccaportes/Scikit-DYN2SEL. △ Less

Submitted 17 August, 2020; originally announced August 2020.

Comments: Paper introducing scikit-dyn2sel, a dynamic selection framework for data streams

arXiv:2007.08377 [pdf, other]

doi 10.1142/9789811211072_0007

Random Forest for Dissimilarity-based Multi-view Learning

Authors: Simon Bernard, Hongliu Cao, Robert Sabourin, Laurent Heutte

Abstract: Many classification problems are naturally multi-view in the sense their data are described through multiple heterogeneous descriptions. For such tasks, dissimilarity strategies are effective ways to make the different descriptions comparable and to easily merge them, by (i) building intermediate dissimilarity representations for each view and (ii) fusing these representations by averaging the dis… ▽ More Many classification problems are naturally multi-view in the sense their data are described through multiple heterogeneous descriptions. For such tasks, dissimilarity strategies are effective ways to make the different descriptions comparable and to easily merge them, by (i) building intermediate dissimilarity representations for each view and (ii) fusing these representations by averaging the dissimilarities over the views. In this work, we show that the Random Forest proximity measure can be used to build the dissimilarity representations, since this measure reflects similarities between features but also class membership. We then propose a Dynamic View Selection method to better combine the view-specific dissimilarity representations. This allows to take a decision, on each instance to predict, with only the most relevant views for that instance. Experiments are conducted on several real-world multi-view datasets, and show that the Dynamic View Selection offers a significant improvement in performance compared to the simple average combination and two state-of-the-art static view combinations. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Comments: Published in Handbook of Pattern Recognition and Computer Vision, 2020 (preprint)

arXiv:2007.02572 [pdf, other]

A Novel Random Forest Dissimilarity Measure for Multi-View Learning

Authors: Hongliu Cao, Simon Bernard, Robert Sabourin, Laurent Heutte

Abstract: Multi-view learning is a learning task in which data is described by several concurrent representations. Its main challenge is most often to exploit the complementarities between these representations to help solve a classification/regression task. This is a challenge that can be met nowadays if there is a large amount of data available for learning. However, this is not necessarily true for all r… ▽ More Multi-view learning is a learning task in which data is described by several concurrent representations. Its main challenge is most often to exploit the complementarities between these representations to help solve a classification/regression task. This is a challenge that can be met nowadays if there is a large amount of data available for learning. However, this is not necessarily true for all real-world problems, where data are sometimes scarce (e.g. problems related to the medical environment). In these situations, an effective strategy is to use intermediate representations based on the dissimilarities between instances. This work presents new ways of constructing these dissimilarity representations, learning them from data with Random Forest classifiers. More precisely, two methods are proposed, which modify the Random Forest proximity measure, to adapt it to the context of High Dimension Low Sample Size (HDLSS) multi-view classification problems. The second method, based on an Instance Hardness measurement, is significantly more accurate than other state-of-the-art measurements including the original RF Proximity measurement and the Large Margin Nearest Neighbor (LMNN) metric learning measurement. △ Less

Submitted 6 July, 2020; originally announced July 2020.

Comments: accepted to ICPR 2020 (22/06/2020)

arXiv:1907.09404 [pdf, other]

Deep Learning Approaches for Image Retrieval and Pattern Spotting in Ancient Documents

Authors: Kelly Lais Wiggers, Alceu de Souza Britto Junior, Alessandro Lameiras Koerich, Laurent Heutte, Luiz Eduardo Soares de Oliveira

Abstract: This paper describes two approaches for content-based image retrieval and pattern spotting in document images using deep learning. The first approach uses a pre-trained CNN model to cope with the lack of training data, which is fine-tuned to achieve a compact yet discriminant representation of queries and image candidates. The second approach uses a Siamese Convolution Neural Network trained on a… ▽ More This paper describes two approaches for content-based image retrieval and pattern spotting in document images using deep learning. The first approach uses a pre-trained CNN model to cope with the lack of training data, which is fine-tuned to achieve a compact yet discriminant representation of queries and image candidates. The second approach uses a Siamese Convolution Neural Network trained on a previously prepared subset of image pairs from the ImageNet dataset to provide the similarity-based feature maps. In both methods, the learned representation scheme considers feature maps of different sizes which are evaluated in terms of retrieval performance. A robust experimental protocol using two public datasets (Tobacoo-800 and DocExplore) has shown that the proposed methods compare favorably against state-of-the-art document image retrieval and pattern spotting methods. △ Less

Submitted 22 July, 2019; originally announced July 2019.

Comments: The paper is under consideration at Pattern Recognition Letters

arXiv:1906.09513 [pdf]

Image Retrieval and Pattern Spotting using Siamese Neural Network

Authors: Kelly L. Wiggers, Alceu S. Britto Jr., Laurent Heutte, Alessandro L. Koerich, Luiz S. Oliveira

Abstract: This paper presents a novel approach for image retrieval and pattern spotting in document image collections. The manual feature engineering is avoided by learning a similarity-based representation using a Siamese Neural Network trained on a previously prepared subset of image pairs from the ImageNet dataset. The learned representation is used to provide the similarity-based feature maps used to fi… ▽ More This paper presents a novel approach for image retrieval and pattern spotting in document image collections. The manual feature engineering is avoided by learning a similarity-based representation using a Siamese Neural Network trained on a previously prepared subset of image pairs from the ImageNet dataset. The learned representation is used to provide the similarity-based feature maps used to find relevant image candidates in the data collection given an image query. A robust experimental protocol based on the public Tobacco800 document image collection shows that the proposed method compares favorably against state-of-the-art document image retrieval methods, reaching 0.94 and 0.83 of mean average precision (mAP) for retrieval and pattern spotting (IoU=0.7), respectively. Besides, we have evaluated the proposed method considering feature maps of different sizes, showing the impact of reducing the number of features in the retrieval performance and time-consuming. △ Less

Submitted 22 June, 2019; originally announced June 2019.

Comments: Accepted for IJCNN 2019

arXiv:1906.08580 [pdf, other]

Pattern Spotting in Historical Documents Using Convolutional Models

Authors: Ignacio Úbeda, Jose M. Saavedra, Stéphane Nicolas, Caroline Petitjean, Laurent Heutte

Abstract: Pattern spotting consists of searching in a collection of historical document images for occurrences of a graphical object using an image query. Contrary to object detection, no prior information nor predefined class is given about the query so training a model of the object is not feasible. In this paper, a convolutional neural network approach is proposed to tackle this problem. We use RetinaNet… ▽ More Pattern spotting consists of searching in a collection of historical document images for occurrences of a graphical object using an image query. Contrary to object detection, no prior information nor predefined class is given about the query so training a model of the object is not feasible. In this paper, a convolutional neural network approach is proposed to tackle this problem. We use RetinaNet as a feature extractor to obtain multiscale embeddings of the regions of the documents and also for the queries. Experiments conducted on the DocExplore dataset show that our proposal is better at locating patterns and requires less storage for indexing images than the state-of-the-art system, but fails at retrieving some pages containing multiple instances of the query. △ Less

Submitted 20 June, 2019; originally announced June 2019.

Comments: 6 pages, 9 figures

arXiv:1806.07686 [pdf, other]

Dynamic voting in multi-view learning for radiomics applications

Authors: Hongliu Cao, Simon Bernard, Laurent Heutte, Robert Sabourin

Abstract: Cancer diagnosis and treatment often require a personalized analysis for each patient nowadays, due to the heterogeneity among the different types of tumor and among patients. Radiomics is a recent medical imaging field that has shown during the past few years to be promising for achieving this personalization. However, a recent study shows that most of the state-of-the-art works in Radiomics fail… ▽ More Cancer diagnosis and treatment often require a personalized analysis for each patient nowadays, due to the heterogeneity among the different types of tumor and among patients. Radiomics is a recent medical imaging field that has shown during the past few years to be promising for achieving this personalization. However, a recent study shows that most of the state-of-the-art works in Radiomics fail to identify this problem as a multi-view learning task and that multi-view learning techniques are generally more efficient. In this work, we propose to further investigate the potential of one family of multi-view learning methods based on Multiple Classifiers Systems where one classifier is learnt on each view and all classifiers are combined afterwards. In particular, we propose a random forest based dynamic weighted voting scheme, which personalizes the combination of views for each new patient for classification tasks. The proposed method is validated on several real-world Radiomics problems. △ Less

Submitted 26 June, 2018; v1 submitted 20 June, 2018; originally announced June 2018.

Comments: 10 pages

arXiv:1803.11241 [pdf, ps, other]

Improve the performance of transfer learning without fine-tuning using dissimilarity-based multi-view learning for breast cancer histology images

Authors: Hongliu Cao, Simon Bernard, Laurent Heutte, Robert Sabourin

Abstract: Breast cancer is one of the most common types of cancer and leading cancer-related death causes for women. In the context of ICIAR 2018 Grand Challenge on Breast Cancer Histology Images, we compare one handcrafted feature extractor and five transfer learning feature extractors based on deep learning. We find out that the deep learning networks pretrained on ImageNet have better performance than th… ▽ More Breast cancer is one of the most common types of cancer and leading cancer-related death causes for women. In the context of ICIAR 2018 Grand Challenge on Breast Cancer Histology Images, we compare one handcrafted feature extractor and five transfer learning feature extractors based on deep learning. We find out that the deep learning networks pretrained on ImageNet have better performance than the popular handcrafted features used for breast cancer histology images. The best feature extractor achieves an average accuracy of 79.30%. To improve the classification performance, a random forest dissimilarity based integration method is used to combine different feature groups together. When the five deep learning feature groups are combined, the average accuracy is improved to 82.90% (best accuracy 85.00%). When handcrafted features are combined with the five deep learning feature groups, the average accuracy is improved to 87.10% (best accuracy 93.00%). △ Less

Submitted 29 March, 2018; originally announced March 2018.

arXiv:1803.04460 [pdf, other]

Dissimilarity-based representation for radiomics applications

Authors: Hongliu Cao, Simon Bernard, Laurent Heutte, Robert Sabourin

Abstract: Radiomics is a term which refers to the analysis of the large amount of quantitative tumor features extracted from medical images to find useful predictive, diagnostic or prognostic information. Many recent studies have proved that radiomics can offer a lot of useful information that physicians cannot extract from the medical images and can be associated with other information like gene or protein… ▽ More Radiomics is a term which refers to the analysis of the large amount of quantitative tumor features extracted from medical images to find useful predictive, diagnostic or prognostic information. Many recent studies have proved that radiomics can offer a lot of useful information that physicians cannot extract from the medical images and can be associated with other information like gene or protein data. However, most of the classification studies in radiomics report the use of feature selection methods without identifying the machine learning challenges behind radiomics. In this paper, we first show that the radiomics problem should be viewed as an high dimensional, low sample size, multi view learning problem, then we compare different solutions proposed in multi view learning for classifying radiomics data. Our experiments, conducted on several real world multi view datasets, show that the intermediate integration methods work significantly better than filter and embedded feature selection methods commonly used in radiomics. △ Less

Submitted 12 March, 2018; originally announced March 2018.

Comments: conference, 6 pages, 2 figures

Showing 1–12 of 12 results for author: Heutte, L