Zum Hauptinhalt springen

Showing 1–36 of 36 results for author: Velcin, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.13358  [pdf, other

    cs.CL cs.LG

    Capturing Style in Author and Document Representation

    Authors: Enzo Terreau, Antoine Gourru, Julien Velcin

    Abstract: A wide range of Deep Natural Language Processing (NLP) models integrates continuous and low dimensional representations of words and documents. Surprisingly, very few models study representation learning for authors. These representations can be used for many NLP tasks, such as author identification and classification, or in recommendation systems. A strong limitation of existing works is that the… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  2. arXiv:2405.00632  [pdf, other

    cs.CL cs.AI

    When Quantization Affects Confidence of Large Language Models?

    Authors: Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin

    Abstract: Recent studies introduced effective compression techniques for Large Language Models (LLMs) via post-training quantization or low-bit weight representation. Although quantized weights offer storage efficiency and allow for faster inference, existing works have indicated that quantization might compromise performance and exacerbate biases in LLMs. This study investigates the confidence and calibrat… ▽ More

    Submitted 1 May, 2024; originally announced May 2024.

    Comments: Accepted to NAACL 2024 Findings

  3. Mini Minds: Exploring Bebeshka and Zlata Baby Models

    Authors: Irina Proskurina, Guillaume Metzler, Julien Velcin

    Abstract: In this paper, we describe the University of Lyon 2 submission to the Strict-Small track of the BabyLM competition. The shared task is created with an emphasis on small-scale language modelling from scratch on limited-size data and human language acquisition. Dataset released for the Strict-Small track has 10M words, which is comparable to children's vocabulary size. We approach the task with an a… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: CoNLL 2023 BabyLM Challenge

  4. arXiv:2304.05894  [pdf, other

    cs.LG cs.IR cs.SI

    Dynamic Mixed Membership Stochastic Block Model for Weighted Labeled Networks

    Authors: Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

    Abstract: Most real-world networks evolve over time. Existing literature proposes models for dynamic networks that are either unlabeled or assumed to have a single membership structure. On the other hand, a new family of Mixed Membership Stochastic Block Models (MMSBM) allows to model static labeled networks under the assumption of mixed-membership clustering. In this work, we propose to extend this later c… ▽ More

    Submitted 12 April, 2023; originally announced April 2023.

  5. arXiv:2212.05996  [pdf, other

    cs.LG cs.IR cs.SI

    Dirichlet-Survival Process: Scalable Inference of Topic-Dependent Diffusion Networks

    Authors: Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

    Abstract: Information spread on networks can be efficiently modeled by considering three features: documents' content, time of publication relative to other publications, and position of the spreader in the network. Most previous works model up to two of those jointly, or rely on heavily parametric approaches. Building on recent Dirichlet-Point processes literature, we introduce the Houston (Hidden Online U… ▽ More

    Submitted 12 December, 2022; originally announced December 2022.

  6. arXiv:2212.05995  [pdf, other

    cs.LG cs.IR cs.SI

    Multivariate Powered Dirichlet Hawkes Process

    Authors: Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

    Abstract: The publication time of a document carries a relevant information about its semantic content. The Dirichlet-Hawkes process has been proposed to jointly model textual information and publication dynamics. This approach has been used with success in several recent works, and extended to tackle specific challenging problems --typically for short texts or entangled publication dynamics. However, the p… ▽ More

    Submitted 13 December, 2022; v1 submitted 12 December, 2022; originally announced December 2022.

  7. arXiv:2209.09670  [pdf, other

    cs.AI cs.LG

    Explainable Clustering via Exemplars: Complexity and Efficient Approximation Algorithms

    Authors: Ian Davidson, Michael Livanos, Antoine Gourru, Peter Walker, Julien Velcin, S. S. Ravi

    Abstract: Explainable AI (XAI) is an important developing area but remains relatively understudied for clustering. We propose an explainable-by-design clustering approach that not only finds clusters but also exemplars to explain each cluster. The use of exemplars for understanding is supported by the exemplar-based school of concept definition in psychology. We show that finding a small set of exemplars to… ▽ More

    Submitted 20 September, 2022; originally announced September 2022.

    Comments: 22 pages; 4 figures

  8. arXiv:2209.07816  [pdf, ps, other

    cs.SI cs.LG

    Properties of Reddit News Topical Interactions

    Authors: Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

    Abstract: Most models of information diffusion online rely on the assumption that pieces of information spread independently from each other. However, several works pointed out the necessity of investigating the role of interactions in real-world processes, and highlighted possible difficulties in doing so: interactions are sparse and brief. As an answer, recent advances developed models to account for inte… ▽ More

    Submitted 16 September, 2022; originally announced September 2022.

    Comments: Published at the conference Complex Networks and their Applications

    Journal ref: 2022 Complex Networks and their Applications XI

  9. arXiv:2209.07813  [pdf, other

    cs.LG cs.IR cs.SI

    Serialized Interacting Mixed Membership Stochastic Block Model

    Authors: Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

    Abstract: Last years have seen a regain of interest for the use of stochastic block modeling (SBM) in recommender systems. These models are seen as a flexible alternative to tensor decomposition techniques that are able to handle labeled data. Recent works proposed to tackle discrete recommendation problems via SBMs by considering larger contexts as input data and by adding second order interactions between… ▽ More

    Submitted 16 September, 2022; originally announced September 2022.

    Comments: Published at ICDM 2022

    Journal ref: ICDM 2022 - IEEE International Conference on Data Mining 2022

  10. Interactions in Information Spread

    Authors: Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

    Abstract: Large quantities of data flow on the internet. When a user decides to help the spread of a piece of information (by retweeting, liking, posting content), most research works assumes she does so according to information's content, publication date, the user's position in the network, the platform used, etc. However, there is another aspect that has received little attention in the literature: the i… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

  11. arXiv:2201.12568  [pdf, other

    cs.CL

    Le Processus Powered Dirichlet-Hawkes comme A Priori Flexible pour Clustering Temporel de Textes

    Authors: Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

    Abstract: The textual content of a document and its publication date are intertwined. For example, the publication of a news article on a topic is influenced by previous publications on similar issues, according to underlying temporal dynamics. However, it can be challenging to retrieve meaningful information when textual information conveys little. Furthermore, the textual content of a document is not alwa… ▽ More

    Submitted 29 January, 2022; originally announced January 2022.

    Comments: in French

  12. arXiv:2111.03496  [pdf, other

    cs.CL cs.LG

    Monitoring geometrical properties of word embeddings for detecting the emergence of new topics

    Authors: Clément Christophe, Julien Velcin, Jairo Cugliari, Manel Boumghar, Philippe Suignard

    Abstract: Slow emerging topic detection is a task between event detection, where we aggregate behaviors of different words on short period of time, and language evolution, where we monitor their long term evolution. In this work, we tackle the problem of early detection of slowly emerging new topics. To this end, we gather evidence of weak signals at the word level. We propose to monitor the behavior of wor… ▽ More

    Submitted 5 November, 2021; originally announced November 2021.

  13. arXiv:2109.07170  [pdf, other

    cs.LG cs.DM cs.IR

    Powered Hawkes-Dirichlet Process: Challenging Textual Clustering using a Flexible Temporal Prior

    Authors: Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

    Abstract: The textual content of a document and its publication date are intertwined. For example, the publication of a news article on a topic is influenced by previous publications on similar issues, according to underlying temporal dynamics. However, it can be challenging to retrieve meaningful information when textual information conveys little information or when temporal dynamics are hard to unveil. F… ▽ More

    Submitted 15 September, 2021; originally announced September 2021.

  14. Information Interaction Profile of Choice Adoption

    Authors: Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

    Abstract: Interactions between pieces of information (entities) play a substantial role in the way an individual acts on them: adoption of a product, the spread of news, strategy choice, etc. However, the underlying interaction mechanisms are often unknown and have been little explored in the literature. We introduce an efficient method to infer both the entities interaction network and its evolution accord… ▽ More

    Submitted 1 February, 2022; v1 submitted 28 April, 2021; originally announced April 2021.

    Comments: 18 pages, 4 figures

  15. arXiv:2104.12485  [pdf, other

    cs.LG cs.DM

    Powered Dirichlet Process for Controlling the Importance of "Rich-Get-Richer" Prior Assumptions in Bayesian Clustering

    Authors: Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

    Abstract: One of the most used priors in Bayesian clustering is the Dirichlet prior. It can be expressed as a Chinese Restaurant Process. This process allows nonparametric estimation of the number of clusters when partitioning datasets. Its key feature is the "rich-get-richer" property, which assumes a cluster has an a priori probability to get chosen linearly dependent on population. In this paper, we show… ▽ More

    Submitted 26 April, 2021; originally announced April 2021.

    Comments: 17 pages, 4 figures

  16. arXiv:2004.04552  [pdf, other

    cs.LG cond-mat.stat-mech physics.data-an stat.ML

    Interactions in information spread: quantification and interpretation using stochastic block models

    Authors: Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

    Abstract: In most real-world applications, it is seldom the case that a given observable evolves independently of its environment. In social networks, users' behavior results from the people they interact with, news in their feed, or trending topics. In natural language, the meaning of phrases emerges from the combination of words. In general medicine, a diagnosis is established on the basis of the interact… ▽ More

    Submitted 1 February, 2022; v1 submitted 9 April, 2020; originally announced April 2020.

    Comments: 17 pages, 3 figures, RecSys'21

  17. arXiv:2004.03621  [pdf, other

    cs.IR cs.SI

    New Datasets and a Benchmark of Document Network Embedding Methods for Scientific Expert Finding

    Authors: Robin Brochier, Antoine Gourru, Adrien Guille, Julien Velcin

    Abstract: The scientific literature is growing faster than ever. Finding an expert in a particular scientific domain has never been as hard as today because of the increasing amount of publications and because of the ever growing diversity of expertise fields. To tackle this challenge, automatic expert finding algorithms rely on the vast scientific heterogeneous network to match textual queries with potenti… ▽ More

    Submitted 7 April, 2020; originally announced April 2020.

  18. arXiv:2001.05727  [pdf, other

    cs.IR cs.CL

    Document Network Projection in Pretrained Word Embedding Space

    Authors: Antoine Gourru, Adrien Guille, Julien Velcin, Julien Jacques

    Abstract: We present Regularized Linear Embedding (RLE), a novel method that projects a collection of linked documents (e.g. citation network) into a pretrained word embedding space. In addition to the textual content, we leverage a matrix of pairwise similarities providing complementary information (e.g., the network proximity of two documents in a citation graph). We first build a simple word vector avera… ▽ More

    Submitted 16 January, 2020; originally announced January 2020.

  19. arXiv:2001.03369  [pdf, other

    cs.LG cs.CL cs.IR stat.ML

    Inductive Document Network Embedding with Topic-Word Attention

    Authors: Robin Brochier, Adrien Guille, Julien Velcin

    Abstract: Document network embedding aims at learning representations for a structured text corpus i.e. when documents are linked to each other. Recent algorithms extend network embedding approaches by incorporating the text content associated with the nodes in their formulations. In most cases, it is hard to interpret the learned representations. Moreover, little importance is given to the generalization t… ▽ More

    Submitted 10 January, 2020; originally announced January 2020.

  20. arXiv:1909.05099  [pdf, other

    cs.LG cs.IR stat.ML

    How to detect novelty in textual data streams? A comparative study of existing methods

    Authors: Clément Christophe, Julien Velcin, Jairo Cugliari, Philippe Suignard, Manel Boumghar

    Abstract: Since datasets with annotation for novelty at the document and/or word level are not easily available, we present a simulation framework that allows us to create different textual datasets in which we control the way novelty occurs. We also present a benchmark of existing methods for novelty detection in textual data streams. We define a few tasks to solve and compare several state-of-the-art meth… ▽ More

    Submitted 11 September, 2019; originally announced September 2019.

    Comments: 16 pages

  21. arXiv:1902.11054  [pdf, other

    cs.CL cs.LG cs.SI

    Link Prediction with Mutual Attention for Text-Attributed Networks

    Authors: Robin Brochier, Adrien Guille, Julien Velcin

    Abstract: In this extended abstract, we present an algorithm that learns a similarity measure between documents from the network topology of a structured corpus. We leverage the Scaled Dot-Product Attention, a recently proposed attention mechanism, to design a mutual attention mechanism between pairs of documents. To train its parameters, we use the network links as supervision. We provide preliminary exper… ▽ More

    Submitted 20 March, 2019; v1 submitted 28 February, 2019; originally announced February 2019.

    Comments: Added missing reference

  22. arXiv:1902.11004  [pdf, other

    cs.CL cs.LG cs.SI

    Global Vectors for Node Representations

    Authors: Robin Brochier, Adrien Guille, Julien Velcin

    Abstract: Most network embedding algorithms consist in measuring co-occurrences of nodes via random walks then learning the embeddings using Skip-Gram with Negative Sampling. While it has proven to be a relevant choice, there are alternatives, such as GloVe, which has not been investigated yet for network embedding. Even though SGNS better handles non co-occurrence than GloVe, it has a worse time-complexity… ▽ More

    Submitted 28 February, 2019; originally announced February 2019.

    Comments: 2019 ACM World Wide Web Conference (WWW 19)

  23. Non-parametric clustering over user features and latent behavioral functions with dual-view mixture models

    Authors: Alberto Lumbreras, Julien Velcin, Marie Guégan, Bertrand Jouve

    Abstract: We present a dual-view mixture model to cluster users based on their features and latent behavioral functions. Every component of the mixture model represents a probability density over a feature view for observed user attributes and a behavior view for latent behavioral functions that are indirectly observed through user actions or behaviors. Our task is to infer the groups of users as well as th… ▽ More

    Submitted 18 December, 2018; originally announced December 2018.

    Journal ref: Lumbreras, A., Velcin, J., Guégan, M. et al. Comput Stat (2017) 32:145

  24. arXiv:1807.03719  [pdf, other

    cs.IR

    Peerus Review: a tool for scientific experts finding

    Authors: Robin Brochier, Adrien Guille, Julien Velcin, Benjamin Rothan, Di Cioccio

    Abstract: We propose a tool for experts finding applied to academic data generated by the start-up DSRT in the context of its application Peerus. A user may submit the title, the abstract and optionnally the authors and the journal of publication of a scientific article and the application then returns a list of experts, potential reviewers of the submitted article. The retrieval algorithm is a voting syste… ▽ More

    Submitted 28 June, 2018; originally announced July 2018.

    Comments: in French

    Journal ref: EGC 2018, Jan 2018, Paris, France

  25. arXiv:1806.10813  [pdf, ps, other

    cs.IR

    Impact of the Query Set on the Evaluation of Expert Finding Systems

    Authors: Robin Brochier, Adrien Guille, Benjamin Rothan, Julien Velcin

    Abstract: Expertise is a loosely defined concept that is hard to formalize. Much research has focused on designing efficient algorithms for expert finding in large databases in various application domains. The evaluation of such recommender systems lies most of the time on human-annotated sets of experts associated with topics. The protocol of evaluation consists in using the namings or short descriptions o… ▽ More

    Submitted 28 June, 2018; originally announced June 2018.

    Journal ref: BIRNDL 2018 (SIGIR 2018), Jul 2018, Ann Arbor, Michigan, USA, France

  26. Automatic Language Identification for Romance Languages using Stop Words and Diacritics

    Authors: Ciprian-Octavian Truică, Julien Velcin, Alexandru Boicea

    Abstract: Automatic language identification is a natural language processing problem that tries to determine the natural language of a given content. In this paper we present a statistical method for automatic language identification of written text using dictionaries containing stop words and diacritics. We propose different approaches that combine the two dictionaries to accurately determine the language… ▽ More

    Submitted 14 June, 2018; originally announced June 2018.

  27. arXiv:1612.06195  [pdf, other

    cs.DB cs.IR

    A Scalable Document-based Architecture for Text Analysis

    Authors: Ciprian-Octavian Truică, Jérôme Darmont, Julien Velcin

    Abstract: Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps %(e.g., stem or lemma extraction, part-of-speech tagging, named entities recognition...), and performance and scaling issues. Existing text analysis architectures partly… ▽ More

    Submitted 19 December, 2016; originally announced December 2016.

    Journal ref: 12th International Conference on Advanced Data Mining and Applications (ADMA 2016), Dec 2016, Gold Coast, Australia. Springer, 10086, pp.481-494, 2016, Lecture Notes in Artificial Intelligence

  28. How to Use Temporal-Driven Constrained Clustering to Detect Typical Evolutions

    Authors: Marian-Andrei Rizoiu, Julien Velcin, Stéphane Lallich

    Abstract: In this paper, we propose a new time-aware dissimilarity measure that takes into account the temporal dimension. Observations that are close in the description space, but distant in time are considered as dissimilar. We also propose a method to enforce the segmentation contiguity, by introducing, in the objective function, a penalty term inspired from the Normal Distribution Function. We combine t… ▽ More

    Submitted 10 January, 2016; originally announced January 2016.

    Journal ref: Int. J. Artif. Intell. Tools 23, 1460013 (2014) [26 pages]

  29. Temporal Multinomial Mixture for Instance-Oriented Evolutionary Clustering

    Authors: Young-Min Kim, Julien Velcin, Stéphane Bonnevay, Marian-Andrei Rizoiu

    Abstract: Evolutionary clustering aims at capturing the temporal evolution of clusters. This issue is particularly important in the context of social media data that are naturally temporally driven. In this paper, we propose a new probabilistic model-based evolutionary clustering technique. The Temporal Multinomial Mixture (TMM) is an extension of classical mixture model that optimizes feature co-occurrence… ▽ More

    Submitted 10 January, 2016; originally announced January 2016.

  30. Unsupervised Feature Construction for Improving Data Representation and Semantics

    Authors: Marian-Andrei Rizoiu, Julien Velcin, Stéphane Lallich

    Abstract: Feature-based format is the main data representation format used by machine learning algorithms. When the features do not properly describe the initial data, performance starts to degrade. Some algorithms address this problem by internally changing the representation space, but the newly-constructed features are rarely comprehensible. We seek to construct, in an unsupervised way, new features that… ▽ More

    Submitted 17 December, 2015; originally announced December 2015.

    Journal ref: Journal of Intelligent Information Systems, vol. 40, iss. 3, pp. 501-527, 2013

  31. Semantic-enriched Visual Vocabulary Construction in a Weakly Supervised Context

    Authors: Marian-Andrei Rizoiu, Julien Velcin, Stéphane Lallich

    Abstract: One of the prevalent learning tasks involving images is content-based image classification. This is a difficult task especially because the low-level features used to digitally describe images usually capture little information about the semantics of the images. In this paper, we tackle this difficulty by enriching the semantic content of the image representation by using external knowledge. The u… ▽ More

    Submitted 14 December, 2015; originally announced December 2015.

    Journal ref: M.-A. Rizoiu, J. Velcin, and S. Lallich, "Semantic-enriched Visual Vocabulary Construction in a Weakly Supervised Context," Intelligent Data Analysis, vol. 19, iss. 1, pp. 161-185, 2015

  32. ClusPath: A Temporal-driven Clustering to Infer Typical Evolution Paths

    Authors: Marian-Andrei Rizoiu, Julien Velcin, Stéphane Bonnevay, Stéphane Lallich

    Abstract: We propose ClusPath, a novel algorithm for detecting general evolution tendencies in a population of entities. We show how abstract notions, such as the Swedish socio-economical model (in a political dataset) or the companies fiscal optimization (in an economical dataset) can be inferred from low-level descriptive features. Such high-level regularities in the evolution of entities are detected by… ▽ More

    Submitted 10 December, 2015; originally announced December 2015.

  33. arXiv:1509.07344  [pdf, other

    cs.IR stat.ML

    Opinion mining from twitter data using evolutionary multinomial mixture models

    Authors: Md. Abul Hasnat, Julien Velcin, Stéphane Bonnevay, Julien Jacques

    Abstract: Image of an entity can be defined as a structured and dynamic representation which can be extracted from the opinions of a group of users or population. Automatic extraction of such an image has certain importance in political science and sociology related studies, e.g., when an extended inquiry from large-scale data is required. We study the images of two politically significant entities of Franc… ▽ More

    Submitted 24 September, 2015; originally announced September 2015.

    Comments: Submitted to the Annals of Applied Statistics

  34. arXiv:1505.02324  [pdf, other

    cs.LG stat.ME stat.ML

    Simultaneous Clustering and Model Selection for Multinomial Distribution: A Comparative Study

    Authors: Md. Abul Hasnat, Julien Velcin, Stéphane Bonnevay, Julien Jacques

    Abstract: In this paper, we study different discrete data clustering methods, which use the Model-Based Clustering (MBC) framework with the Multinomial distribution. Our study comprises several relevant issues, such as initialization, model estimation and model selection. Additionally, we propose a novel MBC method by efficiently combining the partitional and hierarchical clustering techniques. We conduct e… ▽ More

    Submitted 6 September, 2015; v1 submitted 9 May, 2015; originally announced May 2015.

    Comments: Accepted in the International Symposium on Intelligent Data Analysis (IDA 2015)

  35. arXiv:1504.07459  [pdf, other

    cs.CL cs.SI

    CommentWatcher: An Open Source Web-based platform for analyzing discussions on web forums

    Authors: Marian-Andrei Rizoiu, Adrien Guille, Julien Velcin

    Abstract: We present CommentWatcher, an open source tool aimed at analyzing discussions on web forums. Constructed as a web platform, CommentWatcher features automatic mass fetching of user posts from forum on multiple sites, extracting topics, visualizing the topics as an expression cloud and exploring their temporal evolution. The underlying social network of users is simultaneously constructed using the… ▽ More

    Submitted 28 April, 2015; originally announced April 2015.

    ACM Class: H.3.5; I.2.7; H.3.5

  36. arXiv:1309.7187  [pdf

    cs.SI

    Analyse des rôles dans les communautés virtuelles : définitions et premières expérimentations sur IMDb

    Authors: Alberto Lumbreras, James Lanagan, Julien Velcin, Bertrand Jouve

    Abstract: Role analysis in online communities allows us to understand and predict users behavior. Though several approaches have been followed, there is still lack of generalization of their methods and their results. In this paper, we discuss about the ground theory of roles and search for a consistent and computable definition that allows the automatic detection of roles played by users in forum threads o… ▽ More

    Submitted 11 March, 2016; v1 submitted 27 September, 2013; originally announced September 2013.

    Comments: 4e Conférence sur les modèles et l'analyse des réseaux : Approches mathématiques et informatiques, MARAMI 2013, in French