Capturing Style in Author and Document Representation

Enzo Terreau    Antoine Gourru Corresponding Author. Email: [email protected].    Julien Velcin Université de Lyon, Lyon 2, ERIC UR3083 Laboratoire Hubert Curien, UMR CNRS 5516, Saint-Etienne, France
Abstract

A wide range of Deep Natural Language Processing (NLP) models integrates continuous and low dimensional representations of words and documents. Surprisingly, very few models study representation learning for authors. These representations can be used for many NLP tasks, such as author identification and classification, or in recommendation systems. A strong limitation of existing works is that they do not explicitly capture writing style, making them hardly applicable to literary data. We therefore propose a new architecture based on Variational Information Bottleneck (VIB) that learns embeddings for both authors and documents with a stylistic constraint. Our model fine-tunes a pre-trained document encoder. We stimulate the detection of writing style by adding predefined stylistic features making the representation axis interpretable with respect to writing style indicators. We evaluate our method on three datasets: a literary corpus extracted from the Gutenberg Project, the Blog Authorship Corpus and IMDb62, for which we show that it matches or outperforms strong/recent baselines in authorship attribution while capturing much more accurately the authors stylistic aspects.

1 Introduction

Deep models for Natural Language Processing are usually based on Transformers, and they rely on latent intermediate representations. These representations are usually built in a self-supervised manner on a language modeling task, such as Masked Language Modeling (MLM) [6] or auto-regressive training [3]. They constitute a good feature space to solve downstream tasks, for example classification or generation, even though some of those tasks are still difficult to handle with prompt-based generative models like ChatGPT [24]. Additionally, some efforts have been made to benefit from large pretrained model to represent documents [4, 25] and even authors, with contributions like Usr2Vec [2], Aut2Vec [10], and DGEA [12]. The main drawback of these models is that they were shown by [35] to mainly focus on topics rather than on stylistic features of the text. It turns out that capturing writing style can be of much interest for some applications.

When working with literacy data or for forensic investigation [38], practitioners are generally interested in detecting similarities in writing style regardless of the topics covered by the authors. The author style can be defined as every writing choice made without semantic information, often study through various linguistic and syntactic features. As demonstrated by [35], most author embedding techniques rely on the semantic content of documents: a poem and a fiction writing on flowers will be placed closer in the latent space, regardless of their strong differences in sentence construction, structure, etc.

Refer to caption
Figure 1: Author and book representations from R-PGD.                                                                                                                                    We here present a 2D projection with T-SNE of VADES documents and authors embeddings on R-PGD. Books are represented with diamond, authors with dot. The bigger the dot, the bigger the author variance learnt.

As an answer to these limitations, we propose a new model that builds a representation space which captures writing style by using stylistic metrics as additional input features. We follow [19] and leverage the Variational Information Bottleneck (VIB) framework [1], that was shown to outperform the classical pointwise contrastive training. More precisely, we propose to use it to fine tune a pretrained document encoder (such as [4]) and author representations on an authorship attribution task. This is, to our knowledge, the first time that this framework is applied to author representation learning. Then, we add an additional term in the objective function to enforce the representations to capture stylistic features. We name this new model VADES. Using pretrained models allows to benefit from accurate intermediate text representations, built on ready-to-use language resources. In Figure 1, we present a subset of authors from the Project Gutenberg and the representation of the documents they wrote. The size of author’s vector is proportional to its variance, learnt by using the VIB framework. As expected, some outlier productions from authors in term of style (e.g., Thus Spake Zarathustra from Nietzsche) lie closer in the representation space to books of the same genre. More precisely, our model allows 1) to capture author and document style, 2) to build an interpretable representation space to be used by researchers in linguistic, literature and public at large, 3) to predict stylistic features such as readibility index, NER frequencies, more accurately than every existing neural based methods, 4) accurately identify document’s author, even when they are unknown.

After a presentation of related works, we introduce the theoretical foundations of the VIB framework, we then describe our model and how it is optimized. In the last section, we present experimental results on two tasks: author identification and stylistic features prediction. Our experiments demonstrate that our model outperforms or matches existing author embedding methods, in addition to being able to infer representations for unseen documents, measure semantic uncertainty of authors and documents, and capture author stylistic information.

2 Related Works

2.1 Author Embedding Models

Word embedding, popularized by [22], was then extended to document embedding by the same authors. More recent works [4] propose different aggregation functions of word embeddings, based on LSTM, Transformers, and Deep Averaging Networks, to build (short) document level representations. The aggregations is learnt through classification or document pairing. More recently, [25] proceed in a similar way by fine tuning a BERT model [6].

There are also specific works focusing on author embeddings. The Author Topic Model (ATM) [26] is a hierarchical graphical model, optimized through Gibbs sampling. It produces a distribution over jointly learnt topic factors that can be used as author features. Aut2vec [10] allows to learn representations of authors and documents that can separate true observed pairs and negative sampled (document, author) pairs. The distance between two representations modifies an activation function producing a probability that the pair is observed in the corpus. This approach concatenates two sub models: the Link Info model, which takes pairs of collaborating authors, and the Content Info model, which uses pairs of author and documents. It cannot infer representations of unseen documents and authors: the embeddings are parameters of an embedding layer. The Usr2vec model [2] learns author representation from pretrained word vectors. Authors use the same objective than [22], and add an author id to learn the representations.

2.2 Writing Style-oriented Embedding Models

While there is no consensual definition of writing style, it has always been a widely addressed research topic. In computational linguistic, the approach of [16] is often cited as a reference and gives the following definition: “Style is, on a surface level, very obviously detectable as the choice between items in a vocabulary, between types of syntactical constructions, between the various ways a text can be woven from the material it is made of.”, and the author to conclude further to the “impossibility of drawing a clean line between meaning and style”. That’s why style is commonly defined as every writing choice without semantic information.

Based on this definition, it is hard, if not impossible, to produce a clear annotated dataset classifying different writing style. The workaround in most studies is to identify the most useful stylistic features to associate an author to its production. It starts in the 19th century with [21] and the most basic features (e.g., word and punctuation frequencies, hapax legomena, average sentence length). More recent works focus on function words frequencies [41], hybrid variables such as character n-grams [34, 31] or even Part-Of-Speech (POS) and Name Entity Recognition (NER) tag frequencies, using authorship prediction as evaluation.

Several methods try to use these stylistic features to learn document representations. For example, [20] use Doc2Vec on documents of character trigrams annotated regarding their position in the word or if they contain punctuation (NGRAM Doc2Vec). According to the authors, it allows to capture both content and writing style. In an other work, words and POS tags embeddings are learnt together before passing them through a CNN to get a sentence representation [14]. Then these sentences are fed into an LSTM with a final attention layer to compute document representation. This model is trained on the authorship attribution task.

Some works claim to capture this information in an unsupervised manner. DBert-ft [13] fine-tunes DistilBERT on the authorship attribution task, assuming that an author writing style must be consistent over its documents, and thus, that this task allows to build a “stylometric latent space” when the model is trained on a reference set. Yet, for all above models, no author representation is explicitely learnt.

3 Our model: VADES

Refer to caption
Figure 2: VADES in one picture.                                                                                                                                     We draw a single representation zdsubscript𝑧𝑑z_{d}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT using the reparametrization trick. Authors mean and variance are trainable parameters (embedding layers). LVADESsubscript𝐿𝑉𝐴𝐷𝐸𝑆L_{VADES}italic_L start_POSTSUBSCRIPT italic_V italic_A italic_D italic_E italic_S end_POSTSUBSCRIPT computes the probability of the author/document pair to be observed, plus a regularization term and a stylometric features-based loss, see Eq.5.

3.1 Goal and VIB Framework

We deal with a set of documents, such as literature or blog posts. We assume each document is written by one author. Each document of indice d𝑑ditalic_d is preprocessed to extract a vector zdfsuperscriptsubscript𝑧𝑑𝑓z_{d}^{f}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT of r=300𝑟300r=300italic_r = 300 stylistic features following [35].

Our goal is threefold: i) We want to build author and document representations in the same space rsuperscript𝑟\mathbb{R}^{r}blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT such that their proximity captures their stylistic similarity (Figure 1), ii) We want to learn a measure of variability in style for each document and author , and iii) We want our model to incorporate an on-the-shelf pre-trained text encoder such as Sentence-BERT or USE to benefit from their complex language understanding, fine-tuned on the dataset at hand using the objective we have just defined. To do that, we build an architecture based on the Variational Information Bottleneck (VIB) framework.

The VIB framework is a variational extension of the Information Bottleneck principle [36] proposed by [1]. The general objective function is, for a set of observations x𝑥xitalic_x, to associate labels y𝑦yitalic_y and latent representations z𝑧zitalic_z of these observations:

argmaxzI(z,y)βI(z,x),subscript𝑧𝐼𝑧𝑦𝛽𝐼𝑧𝑥\arg\max\limits_{z}I(z,y)-\beta I(z,x),roman_arg roman_max start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_I ( italic_z , italic_y ) - italic_β italic_I ( italic_z , italic_x ) , (1)

where I is the well-known Mutual Information measure, defined as:

I(x,y)=p(x,y)logp(x,y)p(x)p(y)dxdy.𝐼𝑥𝑦𝑝𝑥𝑦𝑝𝑥𝑦𝑝𝑥𝑝𝑦subscript𝑑𝑥subscript𝑑𝑦I(x,y)=\int\int p(x,y)\log\frac{p(x,y)}{p(x)p(y)}d_{x}d_{y}.italic_I ( italic_x , italic_y ) = ∫ ∫ italic_p ( italic_x , italic_y ) roman_log divide start_ARG italic_p ( italic_x , italic_y ) end_ARG start_ARG italic_p ( italic_x ) italic_p ( italic_y ) end_ARG italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT . (2)

Information Bottleneck aims at maximally compressing the information in z𝑧zitalic_z, such that z𝑧zitalic_z is highly informative regarding the labels, i.e. z𝑧zitalic_z can be used to predict the labels y𝑦yitalic_y. With y𝑦yitalic_y being a set of relevant stylistic features, we would like to maximize the stylistic information captured by the representation, while minimizing the semantic one. β0𝛽0\beta\geq 0italic_β ≥ 0 is a hyper-parameter that controls the balance between the two sub-objectives.

In this approach, p(z|x)𝑝conditional𝑧𝑥p(z|x)italic_p ( italic_z | italic_x ) (the “encoding law”) is defined by modeling choices. Most of the time, the mutual information is intractable. We then obtain a lower bound of Eq.1 by using variational approximations thanks to [1]:

Lvib=𝔼[logq(y|z)]βKL(p(z|x)||q(z))-L_{vib}=\mathbb{E}[\log q(y|z)]-\beta KL(p(z|x)||q(z))- italic_L start_POSTSUBSCRIPT italic_v italic_i italic_b end_POSTSUBSCRIPT = blackboard_E [ roman_log italic_q ( italic_y | italic_z ) ] - italic_β italic_K italic_L ( italic_p ( italic_z | italic_x ) | | italic_q ( italic_z ) ) (3)

where q(y|z)𝑞conditional𝑦𝑧q(y|z)italic_q ( italic_y | italic_z ) is a variational approximation of p(y|z)𝑝conditional𝑦𝑧p(y|z)italic_p ( italic_y | italic_z ) and q(z)𝑞𝑧q(z)italic_q ( italic_z ) approximates p(z)𝑝𝑧p(z)italic_p ( italic_z ). Maximizing Eq.3 leads to increasing Eq.1.

3.2 VIB for Embedding with Stylistic Constraints

[23] propose to use this framework to learn probabilistic representations of images. They leverage an instance of this framework based on siamese networks with a (soft) contrastive loss objective function, to separate positive observed pairs of images (y=1𝑦1y=1italic_y = 1) and negative examples (y=0𝑦0y=0italic_y = 0). We extend this model to document and author embedding with stylistic constraint. Each author a𝑎aitalic_a (resp. document d𝑑ditalic_d) is associated to a stochastic representation zasubscript𝑧𝑎z_{a}italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT (resp. zdsubscript𝑧𝑑z_{d}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT) that is unobserved (i.e., latent). Additionally, each document is associated to a stylistic feature vector zdfsuperscriptsubscript𝑧𝑑𝑓z_{d}^{f}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT that is beforehand extracted from the corpus with usual NLP toolkits. We assume that the dimensions of zasubscript𝑧𝑎z_{a}italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, zdsubscript𝑧𝑑z_{d}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and zdfsuperscriptsubscript𝑧𝑑𝑓z_{d}^{f}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT are the same (r𝑟ritalic_r).

We build a set of pairs (a,d)𝑎𝑑(a,d)( italic_a , italic_d ) with label ya=1subscript𝑦𝑎1y_{a}=1italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 1 if a𝑎aitalic_a wrote d𝑑ditalic_d. We additionally draw k𝑘kitalic_k negative pairs (a,d)superscript𝑎𝑑(a^{\prime},d)( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_d ) for each observed pair, associated with label ya=0subscript𝑦𝑎0y_{a}=0italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0, where asuperscript𝑎a^{\prime}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is not an author of d𝑑ditalic_d. The encoding laws (p(z|x)𝑝conditional𝑧𝑥p(z|x)italic_p ( italic_z | italic_x )) for authors and documents are normal laws. To capture stylistic information, we also build a set of pairs (d,d)𝑑𝑑(d,d)( italic_d , italic_d ) with label yf=1subscript𝑦𝑓1y_{f}=1italic_y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 1 and we draw k𝑘kitalic_k negative pairs (d,d)𝑑superscript𝑑(d,d^{\prime})( italic_d , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) for each observed pair, associated with label yf=0subscript𝑦𝑓0y_{f}=0italic_y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 0. These pairs are used to train the stylistic objective : the representation zdsubscript𝑧𝑑z_{d}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT of a document should be close to its feature vector zdfsuperscriptsubscript𝑧𝑑𝑓z_{d}^{f}italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT.

We learn the following parameters for each author a𝑎aitalic_a: mean μasubscript𝜇𝑎\mu_{a}italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and diagonal variance matrix with diagonal σa2superscriptsubscript𝜎𝑎2\sigma_{a}^{2}italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (these are embedding layers). For a document d𝑑ditalic_d, we use a trainable text encoder to map a document’s content to a vector d0r0subscript𝑑0superscriptsubscript𝑟0d_{0}\in\mathbb{R}^{r_{0}}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. We then build the document mean μd=f(d0)rsubscript𝜇𝑑𝑓subscript𝑑0superscript𝑟\mu_{d}=f(d_{0})\in\mathbb{R}^{r}italic_μ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_f ( italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT and diagonal variance matrix with diagonal σd2=g(d0)rsuperscriptsubscript𝜎𝑑2𝑔subscript𝑑0superscript𝑟\sigma_{d}^{2}=g(d_{0})\in\mathbb{R}^{r}italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_g ( italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. As we will show later, the dimension r𝑟ritalic_r should match the number of stylistic features to gain in comprehension of the learning space, but the text encoder can output vectors of any dimension (here, r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT). Following [1, 23], f𝑓fitalic_f and g𝑔gitalic_g are neural networks. We give more details on f𝑓fitalic_f, g𝑔gitalic_g (the “encoding functions”), and the text encoder later.

Following [23], the probability of a label is the soft contrastive loss:

q(ya=1|za,zd)=σ(cazazd2+ea)q(yf=1|zd,zdf)=σ(cfzdzdf2+ef),𝑞subscript𝑦𝑎conditional1subscript𝑧𝑎subscript𝑧𝑑𝜎subscript𝑐𝑎subscriptnormsubscript𝑧𝑎subscript𝑧𝑑2subscript𝑒𝑎𝑞subscript𝑦𝑓conditional1subscript𝑧𝑑superscriptsubscript𝑧𝑑𝑓𝜎subscript𝑐𝑓subscriptnormsubscript𝑧𝑑superscriptsubscript𝑧𝑑𝑓2subscript𝑒𝑓\begin{split}q(y_{a}=1|z_{a},z_{d})&=\sigma(-c_{a}||z_{a}-z_{d}||_{2}+e_{a})\\ q(y_{f}=1|z_{d},z_{d}^{f})&=\sigma(-c_{f}||z_{d}-z_{d}^{f}||_{2}+e_{f}),\end{split}start_ROW start_CELL italic_q ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 1 | italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL start_CELL = italic_σ ( - italic_c start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | | italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_q ( italic_y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 1 | italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ) end_CELL start_CELL = italic_σ ( - italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | | italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) , end_CELL end_ROW (4)

where σ𝜎\sigmaitalic_σ is the sigmoid function, ca,cf>0subscript𝑐𝑎subscript𝑐𝑓0c_{a},c_{f}>0italic_c start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT > 0 and ea,efsubscript𝑒𝑎subscript𝑒𝑓e_{a},e_{f}\in\mathbb{R}italic_e start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∈ blackboard_R. We introduce an additional parameter α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] to control the importance given to the features and to the authorship prediction objective. We can define the loss function (to minimize) based on the VIB framework as follows:

=(1α)𝔼p(za|xa),p(zd|xd)[logq(ya|za,zd)]α𝔼p(zd|xd)[logq(yf|zd,zdf)]+β(KL(p(za|xa)||q(za))+KL(p(zd|xd)||q(zd)))\begin{split}\mathcal{L}=&-(1-\alpha)\mathbb{E}_{p(z_{a}|x_{a}),p(z_{d}|x_{d})% }[\log q(y_{a}|z_{a},z_{d})]\\ &-\alpha\mathbb{E}_{p(z_{d}|x_{d})}[\log q(y_{f}|z_{d},z_{d}^{f})]\\ &+\beta\left(KL(p(z_{a}|x_{a})||q(z_{a}))+KL(p(z_{d}|x_{d})||q(z_{d}))\right)% \end{split}start_ROW start_CELL caligraphic_L = end_CELL start_CELL - ( 1 - italic_α ) blackboard_E start_POSTSUBSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) , italic_p ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_q ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - italic_α blackboard_E start_POSTSUBSCRIPT italic_p ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_q ( italic_y start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_β ( italic_K italic_L ( italic_p ( italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) | | italic_q ( italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ) + italic_K italic_L ( italic_p ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) | | italic_q ( italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) ) end_CELL end_ROW (5)

Here, α=0𝛼0\alpha=0italic_α = 0 will produce representations that well predict the author-document relation but will not capture the stylistic features of the documents, as shown by [35]. With α=1𝛼1\alpha=1italic_α = 1, on the contrary, the model will simply bring document embeddings closer to their feature vectors. Hence, the value of α𝛼\alphaitalic_α needs to be carefully tuned on the dataset, regarding if the corpus is writing style specific or not thanks to domain knowledge.

Eventually, computing the expected values in Eq.(5) is intractable for a wide range of encoders. We therefore approximate it by sampling L𝐿Litalic_L examples by observation (here, a triplet document, author, feature vector), following p(z|x)𝑝conditional𝑧𝑥p(z|x)italic_p ( italic_z | italic_x ) as done in [23]. We get (the same goes for feature vector/documents pairs) :

𝔼[logq(ya|za,zd)]1Ll=1Llogq(ya|za(l),zd(l))𝔼delimited-[]𝑞conditionalsubscript𝑦𝑎subscript𝑧𝑎subscript𝑧𝑑1𝐿superscriptsubscript𝑙1𝐿𝑞conditionalsubscript𝑦𝑎subscriptsuperscript𝑧𝑙𝑎subscriptsuperscript𝑧𝑙𝑑\begin{split}\mathbb{E}[\log q\left(y_{a}|z_{a},z_{d}\right)]&\approx\frac{1}{% L}\sum_{l=1}^{L}{\log q(y_{a}|z^{(l)}_{a},z^{(l)}_{d})}\\ \end{split}start_ROW start_CELL blackboard_E [ roman_log italic_q ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] end_CELL start_CELL ≈ divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log italic_q ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT | italic_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW (6)

We then use the reparametrization trick, following what is done in VAE [17]:

za(l)=μa+σaϵ,zd(l)=μd+σdϵwithϵ𝒩(0,1)subscriptsuperscript𝑧𝑙𝑎subscript𝜇𝑎direct-productsubscript𝜎𝑎italic-ϵ,subscriptsuperscript𝑧𝑙𝑑subscript𝜇𝑑direct-productsubscript𝜎𝑑italic-ϵwithitalic-ϵsimilar-to𝒩01z^{(l)}_{a}=\mu_{a}+\sigma_{a}\odot\epsilon\text{,}\hskip 5.0ptz^{(l)}_{d}=\mu% _{d}+\sigma_{d}\odot\epsilon\hskip 5.0pt\text{with}\hskip 5.0pt\epsilon\sim% \mathcal{N}(0,1)italic_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ⊙ italic_ϵ , italic_z start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⊙ italic_ϵ with italic_ϵ ∼ caligraphic_N ( 0 , 1 )

This loss can now be minimized using backpropagation. In Figure 2, we show a schematic representation of our model, called VADES for Variational Author and Document Representations with Style.

3.3 Encoding Functions and Choice of the Encoder

The entering bloc of our model for documents is a text encoder, mapping a document in natural language to a vector in rsuperscript𝑟\mathbb{R}^{r}blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. Many deep architectures could be used here and trained from scratch. Nevertheless, we propose to use a pretrained text encoder.

Models that are pretrained on large datasets are now easily available online111e.g., https://huggingface.co/models. They have been proved successful on many NLP tasks with a simple fine-tuning phase (the only constraint being to avoid catastrophic forgetting). Additionally, the VIB framework allows to naturally introduce a pretrained text encoder as shown by [19]. The encoder’s output should then be mapped to document mean and variance. Both [19, 12] map the text encoder output to the document’s mean (the f𝑓fitalic_f function) and variance (the g𝑔gitalic_g function) using a Multi Layer Perceptron (MLP). This approach is simple, and fast. In our experiments, we build f𝑓fitalic_f and g𝑔gitalic_g as two-layer MLP with tanh and linear activation with same input and intermediate dimensions (r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT). Note that the output dimension of f𝑓fitalic_f and g𝑔gitalic_g should be the same as the number of stylistic features (r𝑟ritalic_r).

Several constraints arise regarding the pretrained encoder itself. We would like our model to be able to capture stylistic information from a given document. As shown in [5, 35], state-of-the-art models trained on large datasets already capture complex grammatical and syntactic notions in their representations, and therefore have the explanatory power requested for our objective. Moreover, our model must be able to deal with long text as it will be used in a literary context. Processing novels, dramas, essays, where writing style interferes the most. This is a serious problem: for example, the widely used BERT model is limited to 512 tokens. Alternative models such as [39] allow to apply transformers to long documents. To circumvent this issue, we use the Deep Averaging Network implementation of the Universal Sentence Encoder (USE) from [4]. It has several advantages over the latter works: it gives no length constraint, it is faster than transformer-based methods and it outperforms Sentence-BERT on stylistic features prediction [35]. The test of other encoder models is left to future works. Finally, note that our model is language agnostic (as it depends on a out-of-the-box text encoder) and can infer representations for unseen documents.

4 Authorship Attribution Datasets

4.1 IMDb Corpus

The IMDb (Internet Movie Database) corpus is one of the most used ones regarding the authorship attribution task. It was introduced by [32] and is composed of 271,000271000271,000271 , 000 movie reviews from 22,1162211622,11622 , 116 online users. However, most of the works are evaluated on the reduction of this dataset to only 62 authors with 1000100010001000 texts for each (IMDb62). Thus, we benchmark our model on IMDb62. As shown later, the task of authorship attribution on this corpus is more or less solved, due to the low number of authors.

4.2 Project Gutenberg Dataset

The Project Gutenberg is a multilingual library of more than 60,000 e-books for which U.S. copyright has expired. It is freely available and started in 1971. We gathered the corpus using [11]. Most of the books are classical novels, dramas, essays, etc. from different eras, which is relevant when studying writing style and represents quite well our context of application. To keep the most authors possible, we randomly sample 10 texts for each author with such a production, leaving 664 authors in our Reduce Project Gutenberg Dataset (R-PGD) (10 times more than IMDB). To be able to deal with such works, we only keep the 200 first sentences of each book.

4.3 Blog Authorship Corpus

This dataset is composed of 681,288 posts from 19,320 authors gathered in the early 2000s by [30]. There are approximately 35 posts and 7,250 words by user. We only take 500 bloggers with at least 50 blogposts to build our reduced dataset of the Blog Authorship Corpus (R-BAC). This dataset is also used in several authorship attribution benchmark, only keeping the top 10 or 50 authors with most productions. We will also test our model on these extraction of the corpus.

Refer to caption
Figure 3: Correlation score between ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT embedding coordinates and ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT stylistic feature for VADES representation on R-PGD.                                                                                                                                     A few values in the Punctuation categories are null as they were not found anywhere in the corpus.

These two last datasets (PGD and BAC) represent two common uses of author embedding (classic literature and web analysis) with a large number of authors. Usual datasets for authorship attribution (CCAT50, NYT, IMDb62) contain far less classes, further from our context of a web extracted corpus (from Blogger or Wordpress for example)… They are also stylistically and structurally different, allowing to evaluate our approach on various textual formats. For each dataset, we perform a 80/20 train-test stratified split.

Refer to caption
Figure 4: ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT embedding axis against ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT stylistic feature for each author representation, for a selection of 4 given features                                                                                                                                     We can see correlation between each feature and their respective embedding axis.
Datasets statistics
Dataset Authors Avg. Tokens Avg. Texts
IMDb62 62 341(±223)341plus-or-minus223341(\pm 223)341 ( ± 223 ) 1000(±0)1000plus-or-minus01000(\pm 0)1000 ( ± 0 )
BAC10 10 91(±184)91plus-or-minus18491(\pm 184)91 ( ± 184 ) 2350(±639)2350plus-or-minus6392350(\pm 639)2350 ( ± 639 )
BAC50 50 98(±167)98plus-or-minus16798(\pm 167)98 ( ± 167 ) 1466(±562)1466plus-or-minus5621466(\pm 562)1466 ( ± 562 )
R-BAC 500 243(±342)243plus-or-minus342243(\pm 342)243 ( ± 342 ) 50(±0)50plus-or-minus050(\pm 0)50 ( ± 0 )
R-PGD 664 2315(±961)2315plus-or-minus9612315(\pm 961)2315 ( ± 961 ) 10(±0)10plus-or-minus010(\pm 0)10 ( ± 0 )
Table 1: Descriptive statistics for the 3 datasets and their decomposition.                                                                                                                                     BAC : Blog Authorship Corpus, PGD : Project Gutenberg Dataset.
Hyperparameter grid search
Hyperparameter Grid
# negative pairs {1, 5, 10, 20}
Monte Carlo sampling {1, 5, 10, 20}
Learning rate {1e-2, 1e-3, 1e-4, 1e-5}
β𝛽\betaitalic_β {1e-1, 1e-2, …, 1e-12}
Feature loss {L2, Cross-Entropy}
Table 2: Grid search used for hyperparameter selection.                                                                                                                                     Selected value in bold.

5 Experiments

5.1 Parameter Setting and Competitors

In this section, we present implementation details for our method and competitors. For the encoder functions f𝑓fitalic_f and g𝑔gitalic_g, we use the architectures presented in the previous section with batch normalization and dropout equal to 0.2 with L2 regularization (1e51𝑒51e-51 italic_e - 5). Grid-search parameters are detailed in Table 2. For L𝐿Litalic_L, we obtain a good trade-off between accuracy and speed with L=10𝐿10L=10italic_L = 10, as we quickly reach a plateau of performance when increasing its value. We can summarize the tuning of α𝛼\alphaitalic_α as follows:

  • α=0𝛼0\alpha=0italic_α = 0 implies no feature loss and stylistic information,

  • α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 gives the same importance to feature loss and author loss,

  • α=0.9𝛼0.9\alpha=0.9italic_α = 0.9 pushes feature loss to boost style detection.

We train the model for 15 epochs on R-PGD and R-BAC, and for 5 epochs on IMDB, BAC10 and BAC50 as the number of authors is around ten times smaller. We use a partition of 2 GPUs V100. On a single GPU, training the model on the R-PGD dataset takes around 10 hours. In the following section, we report the results for the best version of VADES only. As an ablation study, to justify the use of both the VIB framework and stylistic features, we compare our model with and without these components (respectively called VADES no-VIB and VADES (α=0𝛼0\alpha=0italic_α = 0)). The code is available on github and will be shared if the paper is accepted. All the datasets are available online.

We compare our model with several baselines. We use [20] (NGRAM Doc2Vec), a simple average based version of USE [4] (a document representation is built from the average of its sentence encoding, and an author representation is an average of its documents). We also compare our approach to DBert-ft [13], a document embedding method where DistilBERT is fine-tuned on the authorship attribution task. The author embeddings are built by averaging the representations of the documents it wrote. We use the parameters detailed in the authors’ implementation222https://github.com/hayj/DBert-ft.

5.2 Evaluation Tasks

We first evaluate the baselines and VADES regarding how well each method captures writing style. As writing style is a complex and a still discussed notion, there is no supervised dataset to evaluate how a model can grasp it. We therefore use a proxy task that consists in predicting stylistic feature from the latent representations. We follow the experimental protocol of [35]. The stylistic features are extracted using spacy word and sentence tokenizer, POS-tagger and Name Entity Recognition, spacy English stopwords and nltk CMU Dictionary. For each author, we aim to predict the value of all stylistic features from their embeddings. Each feature is standardized before regression. We use an SVR with Radial Basis Function (rbf) kernel as it offers both quick training time and best results among other kernels in our experiments. We evaluate models using Mean Squared Error (MSE) following a 10-fold cross validation scheme.

Secondly, we perform authorship attribution, the task of predicting the author of a given document. We compare VADES with several other authorship attribution methods even though they do not necessarily perform representation learning. Each dataset is split into train and test sets with a 80/20 ratio. For our model, we repeated 5 times the evaluation scheme. For embedding method without classification head, we associate each document with its most plausible author using cosine similarity. We use accuracy to evaluate these results (the percentage of correctly predicted authors out of all data points).

Average MSE Regression Score along with standard deviation (SVR Model) on R-PGD dataset
Embedding Letters Numbers Structural Punctuation Func. words TAG NER Indexes
Content-Info 0.67 (0.17) 0.88 (0.12) 0.55 (0.19) 0.68 (0.16) 0.72 (0.19) 0.65 (0.17) 0.74 (0.14) 0.50 (0.16)
Ngram Doc2Vec 0.63 (0.20) 0.88 (0.12) 0.51 (0.20) 0.58 (0.21) 0.68 (0.19) 0.59 (0.19) 0.71 (0.14) 0.45 (0.15)
USE 0.61 (0.27) 0.86 (0.09) 0.34 (0.18) 0.59 (0.26) 0.65 (0.24) 0.45 (0.29) 0.65 (0.17) 0.27 (0.15)
DBert-ft 0.79 (0.16) 0.92 (0.09) 0.65 (0.15) 0.82 (0.17) 0.84 (0.13) 0.74 (0.14) 0.84 (0.08) 0.60 (0.14)
VADES no-VIB (0.5) 0.55 (0.23) 0.67 (0.11) 0.32 (0.14) 0.66 (0.27) 0.58 (0.21) 0.44 (0.27) 0.62 (0.16) 0.24 (0.14)
VADES (0.0) 0.84 (0.24) 0.91 (0.12) 0.66 (0.13) 0.85 (0.18) 0.91 (0.15) 0.71 (0.23) 0.88 (0.09) 0.61 (0.16)
VADES (0.5) 0.50 (0.22) 0.60 (0.11) 0.28 (0.14) 0.62 (0.27) 0.53 (0.21) 0.40 (0.27) 0.58 (0.15) 0.20 (0.11)
VADES (0.9) 0.47 (0.22) 0.53 (0.10) 0.26 (0.13) 0.59 (0.28) 0.50 (0.21) 0.39 (0.26) 0.56 (0.15) 0.19 (0.10)
Average MSE Regression Score along with standard deviation (SVR Model) on R-BAC dataset
Embedding Letters Numbers Structural Punctuation Func. words TAG NER Indexes
Content-Info 0.80 (0.15) 0.85 (0.07) 0.62 (0.23) 0.92 (0.09) 0.87 (0.12) 0.90 (0.05) 0.93 (0.07) 0.70 (0.29)
Ngram Doc2Vec 0.77 (0.16) 0.88 (0.05) 0.67 (0.16) 0.78 (0.13) 0.84 (0.12) 0.82 (0.09) 0.86 (0.11) 0.67 (0.13)
USE 0.67 (0.25) 0.83 (0.05) 0.45 (0.20) 0.78 (0.17) 0.81 (0.17) 0.63 (0.21) 0.80 (0.17) 0.38 (0.18)
DBert-ft 1.05 (0.09) 1.05 (0.07) 1.01 (0.05) 0.98 (0.22) 1.05 (0.09) 0.95 (0.19) 0.91 (0.20) 1.03 (0.07)
VADES (0.9) 0.52 (0.23) 0.55 (0.09) 0.31 (0.17) 0.76 (0.22) 0.67 (0.20) 0.57 (0.20) 0.73 (0.18) 0.32 (0.20)
Table 3: Feature prediction on R-PGD and R-BAC.                                                                                                                                     MSE score (standard deviation in parenthesis) on the prediction of stylistic features from author embedding on the R-BAC dataset using SVR. The 300 stylistic features are grouped by families. In bold the best scores for each axis. Our model (α𝛼\alphaitalic_α value in parenthesis) performs best with α=0.9𝛼0.9\alpha=0.9italic_α = 0.9.
IMDb62 Blog Authorship Corpus
Approach 62 authors 10 authors 50 authors
Stylistic features + LR 88.2 (0.1) 40.9 (0.2) 28.4 (0.2)
LDA+Hellinger* [7] 82 52.5 18.3
Impostors* [18] x 35.4 22.6
Word Level TF-IDF* 91.4 x x
CNN-Char* [27] 91.7 61.2 49.4
C.Att + Sep.Rec.* [33] 91.8 x x
Token-SVM* [32] 92.5 x x
SCAP* [9] 94.8 48.6 41.6
Cont. N-gram* [29] 94.8 61.3 52.8
(C+W+POS)/LM* [15] 95.9 x x
N-gram + Style* [28] 95.9 x x
N-gram CNN* [40] x 63.7 53.1
Syntax CNN* [40] 96.2 64.1 56.7
DBert-ft [13] 96.7 (0.2) 64.3 (0.2) 58.5 (0.2)
BertAA* [8] 93.0 65.4 59.7
VADES no-VIB (0.5) 91.3 (0.1) 60.9 (0.2) 50.2 (0.2)
VADES (0.0) 94.9 (0.2) 62.6 (0.2) 52.4 (0.2)
VADES (0.1) 95.6 (0.2) 63.8 (0.2) 53.8 (0.2)
Table 4: Authorship Attribution accuracy on IMDb62 and Blog Authorship Corpus                                                                                                                                     Results with * are gathered from other papers, x is for missing results on a given dataset. Best model in bold and second underlined. We here compare our model (in parenthesis α𝛼\alphaitalic_α value) with several authorship attribution models. Our model compete with SOTA model while learning meaningful representations regarding writing style for documents and authors.

5.3 Results on capturing writing style

As explained earlier, we use the author embeddings to perform regression and predict each stylistic features. As shown in Table 4, only using a simple logistic regression on these stylistic features allows to reach decent scores in authorship attribution, close to these of Universal Sentence Encoder, which is a state-of-the-art method in sentence embedding. As they contain strictly no topic information, it demonstrates how good they are as a proxy of writing style. Thus, a model able to capture them is able to capture writing style.

Results on the style MSE metric are shown in Table 3. As expected, our model easily outperforms every baseline on all axes. DBert-ft, only trained on the authorship attribution objective performs the worst. Even though this approach is based on fine-tuned language models which already capture syntactic and grammatical notions [5], this is not the information that seems to be retained by the network when trained on the author attribution task. This is consistent with what was shown in [35]. The models may mainly focus on the semantic information to predict author-document relation. Interestingly, we observe that a simple average of USE representations performs quite well, which confirms that it can successfully capture complex linguistic concepts. VADES is guided by the feature loss to do so.

On a qualitative note, we present two additional visualisations to underline the strong advantage of VADES for linguistic and stylistic applications. In Figure 5, we present a T-SNE 2D projection of the books of the R-PGD dataset colored by their publication year. A clear color gradient appears, demonstrating that our model can grasp the evolution of writing style through the last centuries. Figure 1 shows a toy example of a T-SNE 2D projection of well-known authors from the R-PGD dataset and their books (we use α=0.5𝛼0.5\alpha=0.5italic_α = 0.5). The objects are distributed in the space across clear author specific clusters. The most interesting observation is related to documents that are outside of their author cluster: Thus Spake Zarathustra: A Book for All and None by Nietzsche is a philosophical poem, closer to Hugo, while the rest of its production is mostly essays. The same conclusion goes with The Power of Darkness by Tolstoï, a 5 acts drama, whose embedding is closer to Shakespeare than to Tolstoï novels. The version of Hamlet presented here is fully commented, and thus is closer to analytical and philosophical works of Nietzsche and Plutarch as shown on the figure. We also represent the variance learnt by the model in the size of the author dot. Hugo, who wrote famous novels as well as poetry and dramas, has a greater variance than other authors.

Refer to caption
Figure 5: Books representations from Project Gutenberg and their writing period.                                                                                                                                     We sampled 10 R-PGB books by decades starting in 1650 and present here a 2D T-SNE projection of their VADES embeddings.

5.4 Interpretability of the Representation Space

As we use the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance between document representations and stylistic feature vectors, each of the 300 embedding axes correspond to one given stylistic feature. The soft contrastive loss allows to ensures the L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT constraint (bringing document embedding and stylistic features vectors closer) while being more flexible than a simple regression loss. When experimenting with the latter, the task showed up to be too hard and disadvantageous regarding both authorship attribution scores and writing style loss.

On Figure 3, we show the Pearson correlation score between the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT stylistic feature and the corresponding embedding axis. These correlation values are always maximum for each feature regarding every other embedding coordinate. To further illustrate the interpretability of the embedding space, Figure 4 shows a selection of 4 stylistic features, the representation value of the matching coordinate for each author. The representation space learnt by VADES is interpretable in terms of writing style. In the context of a multidisciplinary project, involving several searchers in literature and linguistic this is a significant added value.

5.5 Results on the Authorship Attribution Task

Results on the authorship attribution task for IMDb62 and Blog Authorship Corpus are presented respectively in Table 4 against state-of-the-art solutions (not necessarily embedding models). On both datasets, our model ranks in top 4, outperforming recent competitors while authorship attribution is not its main task. Our model is outpaced by Syntax CNN [40], DBert-ft [13] and BertAA [8], two variants of BERT fine tuned on the authorship attribution task. As shown by [8], BERT and DistilBERT are really tailored for balanced datasets with short texts such as IMDB62 and Blog Authorship Corpus. The DBert-ft model splits every document in 512 chunks during training, building an even bigger corpus with important improvement, but it is hardly reproducible with our feature loss. BertAA feeds encoded documents from a finetuned BERT together with a set of stylistic features and of most frequent bi-grams and tri-grams to a Logistic Regression. It clearly allows to better perform on Blog Authorship Corpus as this dataset is a mix of several genres and styles, compared to IMDB62 concerning only movie reviews. This confirms our use of stylistic features. Syntax CNN encodes each sentence of a document separately with its syntax. Unfortunately, this model was hardly reproducible and cannot be tested in feature regression using intermediate representation. For VADES  lower values of α𝛼\alphaitalic_α allow to reach the best accuracy in authorship attribution on these datasets. Additional information bring by stylistic features benefit to the authorship attribution when texts are longer.

5.6 Ablation Study and Effect of α𝛼\alphaitalic_α

We here compare our model to no-VIB and without feature loss. Both variations underperform on both tasks. First, the VIB paradigm offers more versatility than fixed document and author representation which is key to grasp a complex notion such as writing style. Then, the feature loss brings additional information for authorship prediction, as shown by BertAA, which use it to improve BERT classification results. Here, our framework enable to use it directly for document and author embeddings. On Figure 6, we evaluate the influence of α𝛼\alphaitalic_α which balances the importance given to author loss and feature loss on both feature regression and authorship attribution. Adding just a few stylistic features information (α=0.1𝛼0.1\alpha=0.1italic_α = 0.1) allows to improve the precision of our model in authorship attribution. It forces the model to extract discriminant stylistic information from the input. Surprisingly the same phenomenon appears when shutting down the author loss (α=1𝛼1\alpha=1italic_α = 1). It creates a deterioration of the style score as authors tend to use a consistent writing style among their documents. Thus gathering a writer with its documents representation also helps to capture its writing habits. ([13] call it the “Intra-author consistency”).

Refer to caption
Figure 6: Effect of α𝛼\mathbf{\alpha}italic_α. We plot the evolution of the style evaluation metric (average MSE score) and of the accuracy with the α𝛼\alphaitalic_α parameter for R-PGD

6 Conclusion

In this article, we presented VADES, a new author and document embedding method which leverages stylistic features. It has several advantages compared to existing works: it easily integrates any pretrained text encoder, it allows to compare authors and documents of any length (e.g., for authorship attribution), build an interpretable representation space by incorporating widely used stylistic features in computational linguistic. It is also able to infer representations for unseen documents at the opposite of most prior approaches. We demonstrated that VADES outperforms existing embedding baselines in stylistic feature prediction, often by a large margin, while staying competitive in authorship attribution.

In further experiments, we will incorporate modern text encoders, such as LLaMA [37]. They are much more difficult to adapt to this task, but as most recent Large Language Model are trained in an autoregressive way, they might have the expressive power needed to grasp stylistic aspects of authors productions.

References

  • [1] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy, ‘Deep variational information bottleneck’, Proceedings of the International Conference on Learning Representations (ICLR), (2017).
  • [2] Silvio Amir, Glen Coppersmith, Paula Carvalho, Mario J Silva, and Byron C Wallace, ‘Quantifying mental health from social media with neural user embeddings’, in Proceedings of the Machine Learning for Healthcare Conference, pp. 306–321, (2017).
  • [3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, et al., ‘Language models are few-shot learners’, CoRR, abs/2005.14165, (2020).
  • [4] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Limtiaco, and al., ‘Universal sentence encoder for english’, in Proceedings of the 2018 Conference on EMNLP: System Demonstrations, pp. 169–174, (2018).
  • [5] Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning, ‘What does BERT look at? an analysis of bert’s attention’, in Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, volume abs/1906.04341, (2019).
  • [6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, ‘Bert: Pre-training of deep bidirectional transformers for language understanding’, in Proceedings of the 2019 Conference of the North American Chapter of the ACL: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, (2019).
  • [7] Sara El, manarelbouanani and Ismail Kassou, ‘Authorship analysis studies: A survey’, International Journal of Computer Applications, 86, (12 2013).
  • [8] Maël Fabien, Esau Villatoro-Tello, Petr Motlicek, and Shantipriya Parida, ‘BertAA : BERT fine-tuning for authorship attribution’, in Proceedings of the 17th International Conference on Natural Language Processing (ICON), pp. 127–137, Indian Institute of Technology Patna, Patna, India, (December 2020). NLP Association of India (NLPAI).
  • [9] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas, ‘Source code author identification based on n-gram author profiles’, in Artificial Intelligence Applications and Innovations, pp. 508–515, Boston, MA, (2006). Springer US.
  • [10] Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi, et al., ‘Author2vec: Learning author representations by combining content and link information’, in Proceedings of the 25th International Conference Companion on World Wide Web, pp. 49–50. International World Wide Web Conferences Steering Committee, (2016).
  • [11] Martin Gerlach and Francesc Font-Clos, ‘A standardized project gutenberg corpus for statistical analysis of natural language and quantitative linguistics’, Entropy, 22(1), 126, (2020).
  • [12] Antoine Gourru, Julien Velcin, Christophe Gravier, and Julien Jacques, ‘Dynamic gaussian embedding of authors’, in Proceedings of the 2022 The Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, (2022).
  • [13] Julien Hay, Bich-Lien Doan, Fabrice Popineau, and Ouassim Ait Elhara, ‘Representation learning of writing style’, in Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pp. 232–243, Online, (November 2020). ACL.
  • [14] Fereshteh Jafariakinabad and Kien A Hua, ‘Style-aware neural model with application in authorship attribution’, in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pp. 325–328. IEEE, (2019).
  • [15] Jaap Kamps, Giannis Tsakonas, Yannis Manolopoulos, Lazaros Iliadis, and Ioannis Karydis, ‘Research and advanced technology for digital libraries 21st’, in Proceedings: 21st International Conference on Theory and Practice of Digital Libraries, 2017, Thessaloniki, Greece, (2017).
  • [16] Jussi Karlgren, ‘The wheres and whyfores for studying textual genre computationally’, AAAI Technical Report (7), 68–70, (2004).
  • [17] Diederik P Kingma and Max Welling, ‘Auto-encoding variational bayes’, Proceedings of the International Conference on Learning Representations (ICLR), (2014).
  • [18] Winter Koppel, Moshe and Yaron, ‘Determining if two documents are written by the same author’, Journal of the Association for Information Science and Technology, 65(1), 178–187, (2014).
  • [19] Rabeeh Karimi Mahabadi, Yonatan Belinkov, and James Henderson, ‘Variational information bottleneck for effective low-resource fine-tuning’, in International Conference on Learning Representations, (2021).
  • [20] Suraj Maharjan, Deepthi Mave, and et al. Shrestha, ‘Jointly learning author and annotated character N-gram embeddings: A case study in literary text’, International Conference RANLP, (2019).
  • [21] T. C. Mendenhall, ‘The characteristic curves of composition’, Science, ns-9, (1887).
  • [22] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, ‘Distributed representations of words and phrases and their compositionality’, in Advances in neural information processing systems, pp. 3111–3119, (2013).
  • [23] Seong Joon Oh, Kevin Murphy, Jiyan Pan, Joseph Roth, Florian Schroff, and Andrew Gallagher, ‘Modeling uncertainty with hedged instance embedding’, in Proceedings of the International Conference on Learning Representations, (2019).
  • [24] Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang, ‘Is chatgpt a general-purpose natural language processing task solver?’, arXiv:2302.06476, (2023).
  • [25] Nils Reimers and Iryna Gurevych, ‘Sentence-bert: Sentence embeddings using siamese bert-networks’, Proceedings of the International Conference on EMNLP, (2019).
  • [26] Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth, ‘The author-topic model for authors and documents’, in Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 487–494, (2004).
  • [27] Sebastian Ruder, Parsa Ghaffari, and John G. Breslin, ‘Character-level and multi-channel convolutional neural networks for large-scale authorship attribution’, CoRR, abs/1609.06686, (2016).
  • [28] Yunita Sari, Mark Stevenson, and Andreas Vlachos, ‘Topic or Style ? Exploring the Most Useful Features for Authorship Attribution’, 27th International conference on computational lingustics, 343–353, (2018).
  • [29] Yunita Sari, Andreas Vlachos, and Mark Stevenson, ‘Continuous n-gram representations for authorship attribution’, in Proceedings of the 15th Conference of the European Chapter of the ACL: Volume 2, Short Papers, pp. 267–273, Valencia, Spain, (April 2017). ACL.
  • [30] Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James W. Pennebaker, ‘Effects of age and gender on blogging.’, in AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pp. 199–205. AAAI, (2006).
  • [31] Roy Schwartz, Oren Tsur, Ari Rappoport, and Moshe Koppel, ‘Authorship attribution of micro-messages’, in Proceedings of the 2013 Conference on EMNLP, Seattle, Washington, USA, (October 2013). ACL.
  • [32] Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert, ‘Authorship Attribution with Topic Models’, Computational Linguistics, 40(2), 269–310, (06 2014).
  • [33] Wei Song, Chen Zhao, and Lizhen Liu, ‘Multi-task learning for authorship attribution via topic approximation and competitive attention’, IEEE Access, 7, 177114–177121, (2019).
  • [34] Efstathios Stamatatos, ‘On the robustness of authorship attribution based on character n-gram features’, Journal of Law and Policy, 21, 421–439, (01 2013).
  • [35] Enzo Terreau, Antoine Gourru, and Julien Velcin, ‘Writing style author embedding evaluation’, in Proceedings of the 58th Annual Meeting of the ACL, 2nd Workshop on Evaluation and Comparison of NLP Systems, pp. 84–93, (2021).
  • [36] Naftali Tishby, Fernando C Pereira, and William Bialek, ‘The information bottleneck method’, The 37th annual Allerton Conference on Communication, Control, and Computing, 368–377, (1999).
  • [37] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al., ‘Llama: Open and efficient foundation language models’, arXiv preprint arXiv:2302.13971, (2023).
  • [38] Min Yang and Kam-Pui Chow, ‘Authorship attribution for forensic investigation with thousands of authors’, in ICT Systems Security and Privacy Protection, pp. 339–350, Berlin, Heidelberg, (2014). Springer Berlin Heidelberg.
  • [39] Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, and et al. Ainslie, ‘Big bird: Transformers for longer sequences’, Advances in Neural Information Processing Systems, 33, 17283–17297, (2020).
  • [40] Richong Zhang, Zhiyuan Hu, Hongyu Guo, and Yongyi Mao, ‘Syntax encoding with application in authorship attribution’, in Proceedings of the 2018 Conference on EMNLP, pp. 2742–2753, Brussels, Belgium, (October-November 2018). ACL.
  • [41] Ying Zhao and Justin Zobel, ‘Effective and scalable authorship attribution using function words’, in Information Retrieval Technology, pp. 174–189, Berlin, Heidelberg, (2005). Springer Berlin Heidelberg.