Testing learning hypotheses using neural networks by manipulating learning data

Cara Su-Yi Leong [email protected] Tal Linzen

Abstract

Although passivization is productive in English, it is not completely general — some exceptions exist (e.g. *One hour was lasted by the meeting). How do English speakers learn these exceptions to an otherwise general pattern? Using neural network language models as theories of acquisition, we explore the sources of indirect evidence that a learner can leverage to learn whether a verb can passivize. We first characterise English speakers’ judgments of exceptions to the passive, confirming that speakers find some verbs more passivizable than others. We then show that a neural network language model can learn restrictions to the passive that are similar to those displayed by humans, suggesting that evidence for these exceptions is available in the linguistic input. We test the causal role of two hypotheses for how the language model learns these restrictions by training models on modified training corpora, which we create by altering the existing training corpora to remove features of the input implicated by each hypothesis. We find that while the frequency with which a verb appears in the passive significantly affects its passivizability, the semantics of the verb does not. This study highlights the utility of altering a language model’s training data for answering questions where complete control over a learner’s input is vital.

keywords:

learnability, language models, passivization

\affiliation

[1]organization=Department of Linguistics, New York University, addressline=10 Washington Place, city=New York, postcode=10003, state=NY, country=USA \affiliation[2]organization=Department of Linguistics and Center for Data Science, New York University, addressline=60 5th Avenue, city=New York, postcode=10012, state=NY, country=USA

1 Introduction

Speakers of a language have intuitions not only about the language’s broad generalizations, but also about exceptions to those generalizations. For example, the passive is a productive construction in English; speakers freely use transitive verbs in both active and passive voice, and English-speaking children who learn novel transitive verbs in the active voice will use those verbs in the passive voice (Brooks and Tomasello, 1999; Pinker et al., 1987). However, this generalization does not hold for a small set of verbs, such as last, which is acceptable in the active but not in the passive: \pex \a The meeting lasted one hour. \a\ljudge* One hour was lasted by the meeting. \xe

One possible explanation for why speakers judge (\lastxb) as unacceptable is that people may never have heard a sentence like (\lastxb). But an English learner is also unlikely to have encountered a passive sentence like (\nextxb), which uses a rare verb:

\pex\a

The writer defenestrated the editor. \aThe editor was defenestrated by the writer. \xeThe relative scarcity of passives in everyday speech — on average, one out of ten utterances occurs in the passive voice (Roland et al., 2007) — combined with the rarity of the lemma itself make the odds of hearing a sentence like (\lastxb) vanishingly low. Yet, English speakers are likely to judge (\lastxb) as acceptable. Under these conditions, how do learners of English consistently arrive at a grammar under which last cannot be passivized but defenestrate can? This challenge of separating forms that do not occur because they are unacceptable from forms that are not observed due to chance is a learnability problem sometimes referred to as Baker’s Paradox (Baker, 1979).

Positing an innate constraint on the passivizability of certain verbs might seem like a possible solution to this instance of Baker’s Paradox. Such a solution is unlikely to work, however, since exceptions to passive constructions are not universal. For example, while stative verbs like cost and have cannot be passivized in English, they can in Kinyarwanda (Keenan and Dryer, 2007): \ex\ljudge* A new car is had by John. \xe\ex\begingl\glaIbifuungo bibiri bi-fit-w-e n-îshaâti// \glbbuttons two they-have-pass-asp by-shirt// \glft‘Two buttons are had by the shirt.’// \trailingcitation(Keenan and Dryer, 2007, 332) \endgl\xeSince restrictions on passivization are language-specific, and are likely not explicitly taught to children by caregivers, they must be acquired by learners of the language through exposure to indirect evidence.

In this paper, we explore two hypotheses concerning the kinds of indirect evidence present in the linguistic input. The entrenchement hypothesis (Braine and Brooks, 1995; Goldberg, 2006; Theakston, 2004) argues that learners use statistics over their input to both learn where a verb can appear and infer where it cannot occur. Under this hypothesis, learners who never encounter a verb in the passive but see it consistently in many other contexts will use this information to conclude that the verb is unable to appear in the passive.

A second potential source of indirect evidence for passive exceptions is the lexical semantic content of the verb (Ambridge et al., 2016; Pinker, 1989). Under this hypothesis, a verb’s passivizability is directly linked to the extent to which it denotes an action under which the theme participant is affected (Pinker, 1989), that is, it undergoes a change in state, location, or existence caused by the action’s agent participant (Beavers, 2011). Verbs inconsistent with these semantics are unacceptable in the passive.

Both a verb’s affectedness and its frequency of occurrence in the passive relative to the active, as measured through corpus studies, correlate with English speakers’ judgments of its passivizability (Ambridge et al., 2016). Yet it is difficult to study whether these factors are causally implicated in the learning of restrictions on passivization: while a verb with highly unaffected semantics is likely to be used infrequently in the passive, it is unclear whether speakers attend to semantics, frequency of use, or an entirely different signal to learn to make judgments about the verb’s passivizability. Ideally, we would disentangle these correlations by manipulating the language that a human language learner is exposed to (for example, by artificially increasing the frequency of passive forms of unaffected verbs). Since it is not possible to systematically manipulate the input of a child, however, we use neural networks, models of language acquisition as proxies (Warstadt and Bowman, 2022; Baroni, 2022), and manipulate the input to those models.

1.1 The role of neural networks

Neural network language models are systems that learn probability distributions over sequences of words given a text corpus. While they are not tuned or designed to predict linguistic acceptability, they can be used to model acceptability judgments through targeted syntactic evaluations (Lau et al., 2017; Linzen et al., 2016; Marvin and Linzen, 2018; Warstadt et al., 2020). Given a minimal pair such as (1), a neural network language model can be used to assign each sequence of words a probability score. A model that assigns a higher probability to the grammatical sentence (1) than than the ungrammatical (1) shows sensitivity to the underlying syntactic differences between the two sentences. Targeted syntactic evaluations have shown that neural network language models are sensitive to a variety of syntactic and semantic constraints (Linzen et al., 2016; Warstadt et al., 2020).

By using neural network language models as theories of acquisition, we can address an existing methodological limitation in acquisition studies with humans, as these models allowing for complete control of the input provided to the learner: specifically, We can train multiple models on corpora which differ in controlled and targeted ways, and compare how learners with the same initial state and learning goals, but different input, diverge in their behavior at the end of learning.

We present a case study using this method. We train neural network language models on approximately the same amount of linguistic data that humans are exposed to, and use these models to answer two questions: firstly, is a human-scale amount of linguistic input sufficient for a model to learn to make judgments that are similar to human judgments on passive exceptions? Secondly, what kinds of information might a learner be using from the linguistic input to learn to make those judgments?

In Experiment 1, we answer the first question in the affirmative. We show that neural network language models can learn to make acceptability judgments on exceptions to passivization that are similar (though not identical) to those of English speakers. Such behavior suggests that language models can make use of indirect evidence in the linguistic input to learn exceptions.

To answer the second question, we take inspiration from work that uses systematic interventions on a model’s training corpus to test causal links between a model’s input and its eventual inferences (e.g. Misra and Mahowald 2024; Wei et al. 2021). We generate counterfactual training corpora which withhold evidence required by the lexical semantics and entrenchment hypotheses and train language models on both types of corpora. We then compare the acceptability judgments of models trained on these counterfactual datasets against models trained on the unaltered dataset. We find that altering the relative frequency with which a verb appears in the active and passive voice significantly changes models’ judgments of how passivizable a verb is, but altering the lexical semantics associated with a verb does not. We also find that neither of these hypotheses can account fully for a verb’s passivizability, which suggests that additional signals of a verb’s passivizability are present in the linguistic input.

These findings map out a feasible path by which a learner might acquire human-like restrictions on a generalization through statistical indirect evidence in its input. More broadly, they illustrate a method by which researchers interested in understanding the role of linguistic input on learners can examine the effects of large-scale but controlled changes to the learner’s input on the outcome of learning.

1.2 Overview of experiments

In Experiment 1, we compare human judgments of the passivizability of a verb with those of a model which we train. This experiment has two parts. In Experiment 1A, we collect acceptability judgments from English speakers on a set of active and passive sentences containing verbs that are reported in the literature as unacceptable in the passive (Bach, 1980; Levin, 1993; Postal, 2004; Zwicky, 1987). In Experiment 1B, we compare our previously-collected human judgments against the acceptability judgments of a neural network language model which we train on 100M words of English text.

In Experiment 2, we test the causal factors that allowed our model to learn to approximate human judgments. In Experiment 2A, we test whether models’ acceptability judgments about passive sentences containing a highly passivizable verb change if they are trained on a dataset in which that verb occurs much less frequently in the passive than in the original corpus. In Experiment 2B, we train models on a dataset where an unpassivizable verb co-occurs with arguments associated with a canonically passivizable verb. Doing so changes the verb’s expected distribution, which allows us to test whether the lexical semantics of a verb affects its passivizability.

2 Materials

While English passives are subject to some clear restrictions — intransitive verbs cannot occur in the passive (Comrie et al., 1977) — other restrictions are less well-defined. We chose to test restrictions on the use of transitive and transitive-like verbs in the passive. We created a dataset of 140 pairs of active and passive sentences — 280 sentences total — using 28 different verbs, some of which are reported as unpassivizable in the literature. Each sentence pair consisted of an active transitive sentence (e.g. A boy dropped the cup) and a corresponding passive sentence using the be-passive (e.g. The cup was dropped by a boy). Passive sentences always contained an explicit by-phrase that matched the subject of the active sentence (i.e. the sentences had a form like The cup was dropped by a boy, and not The cup was dropped).

As test verbs, we identified five verb classes containing verbs that have been reported as unpassivizable (Bach, 1980; Levin, 1993; Postal, 2004; Zwicky, 1987). Each verb class contained verbs with similar semantics that can be substituted into the same position in a sentence in the active voice. The verbs in each verb class are given below:

•

Advantage verbs: benefit, help, profit, strengthen
•

Price verbs: cost, earn, fetch
•

Ooze verbs: discharge, emanate, emit, radiate
•

Duration verbs: last, require, take
•

Estimation verbs: approximate, match, mirror, resemble

Some of the test verbs have multiple senses, of which only one is exceptional. We created sentences using the sense of the verb reported as unacceptable in the literature. For instance, we did not use sentences that use the sense of take in (\nextxa), but included sentences with the sense illustrated in (\nextxb):

\pex\a

The photo was taken by the boy. \a\ljudge* Two days was taken by the meeting. \xe

Verb class	Active sentence frame	Passive sentence frame
Advantage	The gift my organization.	My organization was by the gift.
Price	Your book thirty dollars.	Thirty dollars was by your book.
Ooze	My machine a sound.	A sound was by my machine.
Duration	Her speech seventeen minutes	Seventeen minutes was by her speech.
Estimation	Your friend my brother.	My sketch was by your friend.

Table 1: Example sentence frames — Each verb in the verb class was substituted into frames specific to the class.

For each verb class, we first formed five sentence frames that were compatible with all the verbs in the class. Table 1 gives examples of active and passive sentence frames for each class (see A for the full list of stimuli). Verbs in the verb class were then substituted into each sentence frame, resulting in 90 total test sentence pairs: 20 pairs each from the advantage, ooze and estimation classes, which have four test verbs, and 15 pairs from the price and duration classes, which have three test verbs. For example, (\nextx) demonstrates a sentence pair generated from the sentence frame in Table 1 using the verb matched:

\pex\a

Your friend matched my brother. \aMy brother was matched by your friend. \xe

In addition to the five test verb classes, which contain verbs expected to be unacceptable in the passive, we created stimuli for two verb classes expected to be acceptable in both the active and the passive voice:

•

Agent-patient: hit, push, wash, drop, carry
•

Experiencer-theme: see, hear, know, like, remember

Given the varied semantics of the verbs in these groups, we used unique sentence pairs for each verb, yielding 50 control test sentence pairs. A total of 140 sentence pairs were created: 50 control sentence pairs and 90 test sentence pairs.

3 Experiment 1A: English speakers find some verbs more passivizable than others

We conducted a human acceptability judgment study to verify judgments from the syntax literature and measure any gradient differences in the degree to which different verbs can be passivized¹¹1Work originally reported in Leong and Linzen 2023..

3.1 Procedure

We collected acceptability judgments from English speakers on the active and passive sentences described in the Materials section. Each participant only saw either the active or the passive of any given sentence pair. Specifically, the 140 sentence pairs were divided into two groups of 70 sentence pairs (i.e. 140 sentences) such that each group contained between two to three sentence frames per verb. Each group was then split into two sets of 70 sentences such that the active and passive versions of each item were in different sets. Each set of sentences contained 70 sentences – one quarter of the test and control stimuli.

The presentation order was further counterbalanced by making four ordered lists for each group as follows. Each group was organized into two lists such that an item that appeared in the first half of of one list appeared in the second half of the other list. The order of items was pseudorandomized within those lists to ensure that not more than two active or passive sentences and no two sentences within the same verb class were seen in succession. These lists were then reversed, so that a total of four ordered sentence lists were made per sentence group.

Refer to caption — Figure 1: Example survey question

Additionally, every critical sentence was followed by at least one filler sentence. Filler sentences (26 grammatical and 52 ungrammatical) were also used as attention checks. Since the passives of control sentences were expected to be acceptable, we included a larger number of ungrammatical than grammatical fillers to balance the experimental stimuli. The full set of stimuli is available in A.

Participants were instructed to rate each sentence’s acceptability based on their gut reaction, and were told that there were no right or wrong answers. Participants rated sentences by moving a slider from “Completely unacceptable” to “Completely acceptable”, which corresponded to an integer score (invisible to them) between 0 and 100. They were not able to rate a sentence with a score of 50. Two practice sentences (one ungrammatical, one grammatical) were used to familiarize participants with the experimental setup. Figure 1 shows an example of the interface that participants used to rate sentences. The full text of the experiment instructions is given in B.

We estimated the amount of explainable variance across all acceptability judgments, as well as within each verb class, using ten split-half reliability analyses as follows. In each of the ten instances of this analysis, participants were randomly split into two groups. We obtained two point estimates for each item by calculating the mean acceptability judgment score of each item within each half. We then calculated the Pearson correlation coefficient between the item estimates for each half of the results. The mean of these ten correlation coefficients was then entered into the Spearman-Brown prophecy formula (Spearman, 1910) to calculate the corrected reliability coefficient. We use this value to estimate how aligned participants are in their acceptability judgments — participants may differ in the consistency of their judgments about different verb classes, reflecting differences in their grammars. In addition, since any predictor using the same method cannot be expected to show greater correlation with the empirical data than the two halves of the data show with each other, this predicted reliability estimate is an approximation of the highest possible correlation obtainable from the data.

3.2 Participants

We recruited 84 participants who had IP addresses located in the US and self-reported as native English speakers via the crowdsourcing platform Prolific. Each participant rated 140 sentences (70 test + 70 filler) and was paid US$3.50. The experiment took an average of 12 minutes to complete.

3.3 Results

Participants were excluded from analysis if they rated more than 15 filler sentences unexpectedly, either by giving ungrammatical sentences scores above 50 or giving grammatical sentences scores below 50. This resulted in the exclusion of 24 participants. In the analysis that follows, each sentence was rated by at least 13 participants.

We calculated the passive drop of a sentence pair as the difference in mean acceptability ratings between its active and passive version. The results are reported in Figure 2; a steeper downward gradient corresponds with a larger passive drop, or greater degree of unacceptability in the passive than in the active. Since the active and passive sentences in a sentence pair contained the same lexical items except for the auxiliary was/were and by, which are common across all sentences, directly comparing active and passive sentences isolates the effect of passivization from lexical effects that might increase the acceptability of sentences with more common verbs like helped compared to low-frequency verbs like profited.

Across all verb classes, participants gave higher scores on average to active sentences (mean: 88.5 points) than passive sentences (mean: 66.4 points). Although the passive drop was positive for all verbs, its magnitude differed across verb classes. The duration class showed the largest mean passive drop (61.9 points), and the ooze class showed the lowest mean passive drop (8.44 points) among the test verb classes. Sentences in the agent-patient class had an average passive drop of 8.86 points.

To determine whether the difference in passive drop between verb classes was significant, we fit a linear mixed-effects model to predict sentence score from sentence type and verb class as well as their interaction as fixed effects; frame and verb as random intercepts; and by-participant random slopes and intercepts for sentence type. We used the agent-patient verb class as the reference level, since passive sentences with these verbs are canonically passivizable. We found significant interactions between sentence type and verb class in four cases: estimation verbs ( $p<0.001$ ), price verbs ( $p<0.001$ ), duration verbs ( $p<0.001$ ) and experiencer-theme verbs ( $p<0.001$ ). This result indicates that passive sentences containing verbs in these three verb classes were significantly different from the canonically acceptable passive sentences containing agent-patient verbs. On the other hand, there was no significant difference in the sentence scores obtained from agent-patient verbs and ooze verbs ( $p=0.754$ ) or advantage verbs ( $p=0.106$ ).

Within the verb classes that were less passivizable than agent-patient verbs, some verbs were also more passivizable than others. We fit linear mixed-effects models to the data within each verb class to predict sentence score using verb and sentence type as fixed effects with random intercepts for frame and by-participant random slopes for sentence type. We used the verb with the smallest passive drop in the verb class as the reference level. We found that there was a significant interaction between sentence type and verb in some but not all cases. For example, last was significantly less passivizable than required ( $p<0.001$ ), but took was not ( $p=0.365$ ), and cost was less passivizable than earned ( $p<0.001$ ) but fetched was not significantly different from earned ( $p=0.257$ ). These results point to the fact that, even among verbs that can occur in the same sentences, some verbs may nonetheless be more passivizable than others (Zwicky, 1987).

Table 2 reports the results of our split-half reliability analysis, which was conducted by repeatedly splitting the dataset into two and comparing the mean item scores of each item between the two halves. The reliability of the collected acceptability judgment scores was high across all test items as well as within verb classes, suggesting that participants had similar judgments about most items.

Verb class	Split-half reliability
All items except fillers	0.92
Advantage	0.88
Estimation	0.90
Price	0.95
Duration	0.97
Ooze	0.80
Agent-Patient	0.77
Experiencer-Theme	0.86
Fillers	0.99

Table 2: Spearman-Brown-corrected split-half reliability for each verb class — Acceptability judgments showed high reliability on all test items as well as within verb classes.

3.4 Discussion

Across all verbs except approximated, and even in sentences containing canonically passivizable agent-patient verbs, participants rated active sentences more highly than passive sentences. This difference may be accounted for by pragmatic factors: each sentence in the acceptability judgment task was presented to participants without any surrounding context, although the passive construction is more pragmatically marked than the active (Comrie, 1988). This setting might have caused participants to rate passive sentences as worse than their active counterparts even in the control verb classes.

Notably, although experiencer-theme verbs are often thought of as passivizable, they showed a significant difference in passive drop from agent-patient verbs. Conversely, despite being reported as unpassivizable in the literature, the advantage and ooze verb classes did not have significantly different passive drops than agent-patient verbs. Such nuances in judgments may be difficult to capture in binary judgments that rely on a single linguist’s introspection, and point to the value of crowdsourcing acceptability judgments for complex phenomena (Sprouse and Almeida, 2017), particularly if the judgments are robust across multiple participants, as we showed.

In summary, the human acceptability judgment experiment demonstrated that some verbs in the verb classes being tested are degraded in the passive voice, and that unacceptability was gradient between verbs. For a model to adequately approximate such behaviour, then, it must exhibit the following characteristics:

•

Verb class-level exceptionality: some verbs classes (e.g. duration verbs) exhibit passive drops that are significantly higher than the baseline passive drop expected of the canonically passivizable agent-patient verbs.
•

Verb-level exceptionality: Some verbs within a verb class (e.g. last and cost) also have passive drops that are significantly different from the passive drops of other verbs in the same class.
•

Gradience: (un)acceptability is gradient, with some verbs on average exhibiting higher passive drop than others.

4 Experiment 1B: Comparing language model and human judgments

In the previous section, we established that English speakers judge unpassivizability on a cline, rating some verbs such as cost as highly unpassivizable, and other verbs such as pushed as highly passivizable. In this section, we compare these judgments to those derived from a neural network language model trained to perform next word prediction on a corpus of 100M words — the order of magnitude of linguistic input available to English speakers by adolescence (Linzen, 2020; Warstadt et al., 2023).

What can we expect the results of this experiment to be? Some positive indications that exceptions can be learned from indirect evidence comes from Bayesian modeling. Exceptions to the dative alternation in English follow a similar pattern to passive exceptions: while many ditransitive verbs can occur in both the double object construction (e.g. Lucy gave Divya a bag) and with a prepositional dative (e.g. Lucy gave a bag to Divya), not all verbs can. Some ditransitive verbs can only occur in one construction (e.g. *Lucy donated Divya the car). Perfors et al. (2010) find that a hierarchical Bayesian model can learn to identify verbs that participate in this alternation after exposure to a subset of the CHILDES child-directed speech corpus (Brown, 1973; MacWhinney, 2000). This result suggests that learning exceptions from indirect evidence in the linguistic data is possible.

These findings might not, however, extend to the neural network language models used in contemporary language technologies. These networks may not have strong enough inductive biases to implement the inference procedure conducted by a Bayesian model, and as such might not be sufficiently sensitive to indirect evidence. Indeed, there is evidence that neural network models sometimes over-generalize, for instance by translating English idioms like “kick the bucket” compositionally instead of treating multi-word expressions as a unit (Dankers et al., 2022). It is thus possible that a language model may not be sensitive to which verbs humans think are exceptional in the passive. Even if neural network language models are sensitive to indirect evidence, their weaker biases may mean that they require super-human amounts of data to learn effectively — neural network models are less data-efficient learners than humans (Warstadt and Bowman, 2022) and are often trained on large datasets of text that are not plausible comparisons to human linguistic input. It is thus unclear how similar to humans a neural network will be if it is trained on a dataset that approximates the amount of access to the passive that a human might.

We obtained acceptability judgments from language models by querying the likelihood that the model assigns to the sequence of tokens that make up a sentence. If the model, which learned a distribution over its input text, was exposed to enough indirect evidence for passive exceptions in the linguistic input, it might be able to match human judgments on both passivizable verbs and unpassivizable verbs as well as the relative gradience of passive drop that humans display.

4.1 Model architecture

The models we trained are based on the transformer architecture (Vaswani et al., 2017) as implemented in GPT-2 (Radford et al., 2019), and in particular GPT-2 small, which has 117M parameters. We used the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 6e-4, and set a maximum input length of 512 tokens and a batch size of 16. The neural network training process is subject to stochasticity under which models might learn probability distributions that make different judgments on our test sentences. We thus trained five different models to verify the reliability of each model’s predictions. Each model was trained for 50 epochs with early stopping if validation loss did not decrease for three consecutive evaluation steps.

We adopted this architecture since there is substantial evidence that GPT-2 models can learn exceptions to passivization: OpenAI’s GPT-2 models produced judgments that correlated well with human acceptability judgments of passive exceptions (Leong and Linzen, 2023). GPT-2 is also sensitive to more general restrictions on English passives; Warstadt et al. (2020) show that GPT-2 gives lower scores to passive sentences containing intransitive verbs (e.g. Jeffrey’s sons are smiled by Tina’s supervisor) than sentences containing transitive verbs (e.g. Jeffrey’s sons are insulted by Tina’s supervisor, showing sensitivity to the fact that intransitive verbs cannot be passivized in English. Finally, GPT-2 also demonstrates sensitivity to other exceptions to general rules, performing well on targeted syntactic evaluations requiring sensitivity to argument structure, such as differentiating between verbs which do and do not participate in dative alternation (Hawkins et al., 2020). However, since we train our models on substantially fewer words than OpenAI’s GPT-2, as we discuss in the next section, it is an empirical question whether models trained on smaller datasets will behave similarly.

4.2 Training corpus

GPT-2 was trained on OpenAI’s proprietary WebText corpus, which contains 40GB of data from web content — approximately 8B words, assuming each word contains an average of 5 bytes (characters). By contrast, English-speaking children are exposed to 2–7M words per year (Gilkerson et al., 2017), or 26M–91M words by the age of 13. As our goal is to determine what can be learned from the data available to humans, we trained our models using a significantly smaller training corpus than Radford et al. (2019). Rounding to the nearest order of magnitude, we trained our models on a 100M word subset of the OpenWebText corpus (Gokaslan and Cohen, 2019) to simulate a more plausible model of the linguistic input a human may receive. The OpenWebText corpus is an open-source reproduction of the Web Text corpus and contains web text linked from Reddit with at least three upvotes. This selection method aims to choose a wide range of web text curated by humans.

4.3 Evaluation

We used the targeted syntactic evaluation paradigm (Linzen et al., 2016; Lau et al., 2017; Warstadt et al., 2019) to compare between our models and human acceptability judgments. Tasks in this paradigm involve obtaining model judgments on minimal pairs in an analogous fashion to our human subjects study. For each sentence in the test set, we obtained a sentence score by summing the log-probabilities assigned to each token in the sentence, which gives a measure of the likelihood the model assigns to that sentence occurring. The sentence with the higher probability of the two sentences in the pair is deemed to be more acceptable.

Before reporting on the model’s acceptability judgments on passivization, we looked at the model’s judgments on a wide range of basic phenomena as a benchmark of the overall reliability of our models sentence scores. We tested our models’ on BLiMP (Warstadt et al., 2020), a broad-coverage acceptability judgment dataset. BLiMP uses the minimal pair paradigm and evaluates a model’s judgments on syntactic and semantic phenomena ranging from subject-verb agreement to restrictions on the distribution of quantifiers like at least, and including, most pertinently, argument structure restrictions such as the unpassivizability of intransitive verbs.

We adapted the targeted syntactic evaluation paradigm in order to collect numeric scores from the model rather than binary acceptability judgments. After obtaining the scores for each sentence, we calculated the passive drop of each sentence pair by subtracting the active score from the passive score, which normalizes for the effects of lexical items in each sentence.

4.4 Results

General syntactic competence

Figure 3 shows our models’ performance on BLiMP compared with the accuracies of OpenAI’s GPT-2 model (trained on the 40GB WebText corpus) as well as an LSTM model trained on 83 million words from the English Wikipedia (Gulordava et al., 2018). Across a wide variety of syntactic and semantic phenomena, our models gave a higher sentence score to grammatical sentences than ungrammatical sentences an average of 76.71% of the time, suggesting that the sentence scores produced by our models are sensitive to syntactic and semantic constraints. Our models performed better than the LSTM model although both types of model were trained on approximately the same amount of data, and were only marginally worse than OpenAI’s GPT-2 despite being trained on 80 times less data.

Our models also demonstrated sensitivity to the restrictions on passivization that are tested in BLiMP. All five models were able to identify at above chance levels that intransitive verbs are less acceptable in the English passive than transitive verbs (passive 1 and passive 2 tests in Figure 3). They also showed a preference for animate subjects in passives (animate subject passive test in Figure 3). These results suggest that our models were able to glean some information about the environments in which the passive construction is acceptable from their training regime.

Exceptions to passivization

We next tested specifically whether our models are able to make human-like judgments on passive exceptions. Figure 4 graphs the models’ average passive drop for each verb against the passive drop observed in our human experiment. A Pearson correlation coefficient was computed to assess the linear relationship between mean human acceptability scores and mean model sentence scores for each item. We found a moderate-to-strong correlation between the two variables, $r(138)=$ 0.61, $p<$ 2.2e-16.

Our models were also able to capture humans’ judgments of exceptionality within verb classes: among verbs with similar meanings, the same verbs had high passive drops in human and model scores. For instance, earned and discharged had low passive drops in both human and model judgments compared to other verbs in their respective classes. Our models also predicted high passive drops for verbs like lasted, resembled and cost, aligning with human judgments that these verbs differ from other verbs in their class.

Finally, our models also largely matched human judgments of gradience: the sentence scores obtained from our models scale well to human judgments in most cases, predicting not only low and high passive drop, but intermediate levels of passive drop in verbs such as for the verbs took and required.

4.5 Discussion

Broadly, our neural network language models captured some aspects of human judgments of passivizability — particularly, the fact that judgments were gradient depending on the verb used, and that some verbs were less passivizable than other verbs in their verb class. That being said, compared to the split-half reliability measure of 0.92 between human participants, which we use as an upper bound for the amount of variance that can be accounted for in the human judgments, the correlation coefficient of 0.61 between our neural network language models’ judgments and human judgments suggests that our models are still some distance away from being accurate models of human acceptablity judgments on passivizability. For instance, our model systematically overpredicted the passive drop of experiencer-theme verbs and underpredicted the passive drop of advantage verbs. In fact, contrary to human judgments, the model found profit and strengthen more acceptable in the passive than in the active. These findings were unexpected and suggest that the model is sensitive to other information at the verb class level that may not have influenced human acceptability judgments.

Yet, our models nonetheless captured key qualitative aspects of gradient and exceptional judgments on passivization. This finding suggests that the models were sensitive to some evidence for passive exceptions that were present in their training data.

5 Experiment 2: Intervening on training data

In the previous sections, we showed that training a neural network language model on 100 million words of English text gives it sufficient evidence to produce judgments of passive exceptions that align substantially with those of humans. In the following sections, we turn to our second research question: which parts of the linguistic input did the model use as evidence to learn these patterns?

To answer this question, we took inspiration from work that has used controlled interventions on a model’s training dataset to make causal links between a model’s input and its behavior (Misra and Mahowald, 2024; Patil et al., 2024; Wei et al., 2021). These approaches use filters to withhold subsets of a model’s training dataset, and then compare models trained on the original dataset against models trained on the filtered dataset to ascertain the importance of the content that was filtered out. Corpus filters reduce the frequency of particular lexical items, as in Wei et al. 2021, or may attempt to completely remove specific linguistic cues (Patil et al., 2024; Misra and Mahowald, 2024). Corpus modifications are particularly effective at identifying which sources of indirect evidence a learner is using: Misra and Mahowald (2024) targeted a rare adjective-noun construction, and explored what indirect evidence a language model uses to learn the construction. They trained models on corpora where they withheld the construction and/or similar constructions that they identified as potential cues for that construction. These approaches illustrate how systematically filtering a training corpus can be used as an effective ablation technique.

In the following sections, we test two hypotheses proposed in the literature for human acquisition of passive exceptions which we call the entrenchment hypothesis (Braine and Brooks, 1995; Demuth, 2011; Theakston, 2004) and the lexical semantic hypothesis (Ambridge et al., 2016; Darmasetiyawan et al., 2022; Messenger et al., 2012; Pinker, 1989). Both hypotheses posit that learners use elements of the linguistic input as indirect signals for the extent to which a verb is passivizable. Thus, we tested these hypotheses by first altering or ablating elements of the corpus that are crucial under each hypothesis for learning to occur, such as the frequency or presence of particular constructions in the dataset. We then trained models on these modified corpora using the same training procedure as was used for our initial models and compare the behavior of the two sets of models. If models trained on a modified corpus consistently differ from models trained on the original corpus in their acceptability judgments on passive exceptions, then we can attribute the change in behavior to the particular intervention that we made in the training data.

Given a model whose behavior is human-like, we interpret what information the model is relying on to reach human-like performance. To the extent that the model is a reliable cognitive model of human language learning, our interventions make a case for the feasibility of learning human-like behavior through that mechanism.

6 Experiment 2A: Frequency significantly affects our models’ acceptability judgments

The first hypothesis we tested is the entrenchment hypothesis (Braine and Brooks, 1995; Regier and Gahl, 2004; Theakston, 2004). This hypothesis argues that learners conclude that a verb cannot appear in a particular context if that verb appears with substantial frequency in other contexts but never in the context in question. This hypothesis relies on a learner’s ability to track the distributional statistics of words in various environments in order to be sensitive to the indirect negative evidence of what constructions are available to a verb. The entrenchment hypothesis may be reliant on either the absolute frequency of an item or its relative frequency in two (or more) contexts.

The entrenchment account explains English-speaking children’s slowness to acquire passives as a result of the rarity of the passive construction in English-speaking children’s input (Brooks and Tomasello, 1999). Gordon and Chafetz (1990) found that children’s comprehension of specific verbs in the passive voice was correlated with how frequently those verbs appeared in the passive in a corpus study of child-directed speech in the CHILDES database. On a broader scale, Demuth and colleagues (Demuth, 1989, 2011; Demuth et al., 2010) study children who learn Sesotho, where the passive construction is common. They find that children readily comprehend and use the passive construction at the age of 2 years and 8 months. In contrast, English-speaking children hear the passive infrequently in child-directed speech — of 86,655 utterances directed at children from 1 year and 6 months to 5 years and 1 month, they find just four passive utterances that include a by-phrase. That English-speaking children do not consistently produce full passives until 4 or 5 years of age (Brooks and Tomasello, 1999) can be explained if children rely on exposure to a verb in the passive to learn its use.

To test this hypothesis, we performed targeted interventions on pairs of verbs: one highly-passivizable verb from the agent-patient class and one highly-unpassivizable verb from the duration class. We treated the highly-passivizable verb as the mutating verb and the highly-unpassivizable verb as the target verb. We decreased the frequency of passive sentences containing the mutating verb in the corpus to match the absolute frequency of the target verb in passive sentences. This intervention decreased both the absolute and relative frequency of the mutating verb in the passive, but did not affect the absolute frequency of the mutating verb in the active.

If our models used the frequency of a verb in the passive as a cue for passivizability, our intervention should cause the originally-passivizable mutating verb to decrease in passivizability. If the only cue to a verb’s passivizability is its frequency of occurrence in the passive, then the mutating verb would become just as unpassivizable as the target verb.

6.1 Procedure

We first used spaCy’s en_core_web_trf pipeline (Honnibal et al., 2020) to obtain dependency parses for each sentence in the training corpus. We then counted the number of times our mutating and target verbs were used in transitive active sentences and passive sentences. Sentences where the mutating or target verb had a dependency edge to a passive auxilliary (auxpass), a passive nominal subject (nsubjpass) or a passive clausal subject (csubjpass) were classified as passive sentences, while sentences with a direct object (dobj) or a clausal complement (ccomp) dependency edge from the mutating or target verb were classified as active. All other sentences were classified as an other sentence type. Figure 5 shows the relative frequencies of occurrence of each mutating and target verb in the active and passive in the training corpus.

The filtering mechanism we chose prioritized precision, potentially at the expense of coverage: we exclusively measured transitive verb occurrences since these sentences were the uses of the verb that could also potentially occur in the passive voice, resulting in a high number of sentences classified as other. For example, sentences such as (\nextx) were excluded from analysis for the verb drop: \pex\aRealtors believe home resales, which dropped in September, peaked in July and August. \aI apologize for that pun, but it’s definitely not worse than the ones Arnold Schwarzenegger drops. \xeConsistent with the entrenchment hypothesis, duration verbs tended to occur relatively less frequently in the passive than the active compared to agent-patient verbs. In fact, last occurred 170.75 times more often in the active than in the passive: it occurred 683 times in the active but just four times in the passive in the original training corpus. On the other hand, agent-patient verbs appeared in passive sentences relatively consistently in the corpus: drop occurred 3279 times in the active and 1146 times in the passive.

Using the corpus statistics and sentence categorization method illustrated above, we created modified corpora for each pair of mutating and target verb. In each dataset, we let the mutating verb appear in the passive only as many times as the target verb does. For example, we intervened to make the distribution of the mutating verb drop, which occurred 3279 times in transitive active sentences and 1146 times in the passive, similar to that of the target verb last, which occurred 683 times in transitive active sentences and four times in the passive. We randomly chose four occurrences of passive drop in the training corpus to keep, and removed all other passive occurrences of drop. Doing so inflates the verb’s relative frequency of occurrence in the active compared to the passive. Figure 6 illustrates the distribution of the training corpora before and after we performed this intervention — the frequencies of all other verbs remained the same, while the frequency of the source verb decreased in the passive and remained constant in the active.

We then trained five models on each corpus following the same training procedure outlined in Experiment 1B. After training these models on the modified corpora, we obtained acceptability judgments from these models and compared them against the judgments of the models we trained on the original corpus in Experiment 1B. If the frequency of a verb in the passive significantly affects that verb’s passivizability, then the mutating verb in a model trained on an modified corpus should be less passivizable relative to models trained on the original data set, while the passive drop of all unaltered verbs should remain the same.

6.1.1 Results

Test sentence pairs containing all four targeted agent-patient mutating verbs increased in passive drop when their frequency in the passive was reduced, whereas on average sentences containing all other (non-mutating) verbs showed no significant change in passive drop (Figure 7). Although the mutating verbs displayed highly variable degrees of passive drop when scores were averaged across items and models, we attributed a large part of this variance to within-items variation in scores across specific sentence frames. When scores for each sentence frame were compared separately, the variance in scores across models was much lower (Figure 8). This suggests that reducing the frequency with which a verb is seen in the passive consistently increased passive drop across all pairs of mutating and target verbs.

We tested the significance of the frequency intervention by fitting a linear mixed-effects model that models the relationship between passive drop and training corpus, using by-model random intercepts as well as by-verb random intercepts and slopes for training corpus. We compared this model with a full model that additionally included whether the verb is mutating in the corpus as a fixed effect. A likelihood-ratio test indicated that the full model provided a better fit for the data than the reduced model, $\chi^{2}(1)=265.65,p<0.01$ . This result suggests that our models’ judgments of a verb’s passivizability was significantly affected by how frequently the verb appeared in the training data in the passive.

Although intervening on passive frequency significantly increased every mutating verb’s passive drop, none of the mutating verbs became as unpassivizable as their target verbs: the average passive drop of the sentences containing the altered mutating verb remained lower than the average passive drop of sentences containing the target verb (Figure 9).

6.2 Discussion

In this experiment, we found that the frequency with which a verb occurs in the passive in the training corpus significantly affects how passivizable the verb is to a model trained on that corpus. This suggests that a verb’s frequency of occurrence in the passive is one source of evidence by which our models learn whether a verb is or is not passivizable.

If frequency of occurrence in the passive were the only driver of unpassivizability in our models, we would expect that each mutating verb would behave exactly like its target verb after our intervention. Contrary to this prediction, although we reduced the passive frequency of mutating verbs to be exactly the frequency of the passive of a target verb, the passive drop of these mutating verbs never increased to the level of unpassivizability of their respective target verbs. This result indicates that passive frequency, at least as operationalized in this experiment, is not the sole driver of unpassivizability in our target verbs.

If either the relative or the absolute frequency of passive occurrences of a verb directly modulated passivizability, then we might expect differences in the magnitude of change in passive drop depending on choice of target verb, since the three target verbs occur with varying frequency in the passive (see Figure 5). Models trained using last as the target verb would demonstrate the highest increase in passive drop since those models were trained on datasets where the mutating verb occurs the most infrequently in the passive. Instead, unexpectedly, each mutating verb showed increases in passive drop of similar magnitudes regardless of target verb. Taken together, these results suggest that frequency alone is unlikely to be the only factor mediating a verb’s passivizability in our models.

7 Experiment 2B: Lexical semantics does not significantly affect our models’ acceptability judgments

We next test the hypothesis that the exceptionality of certain verbs in the passive arises from a verb’s affectedness: whether or not the verb denotes an event where the by-object or implied argument (e.g. the boy in The apple was eaten by the boy) causes a change of state, location, or existence to (i.e. affects) the surface subject of the passive (e.g. the apple in The apple was eaten by the boy) (Ambridge et al., 2016; Darmasetiyawan et al., 2022; Messenger et al., 2012; Pinker, 1989). Supporting this hypothesis, Ambridge et al. (2016) obtained ratings of verbs on a series of proxies for affectedness (e.g. whether A is responsible or A is doing something to B in the sentence A likes B), and found a positive correlation between a verb’s prototypical affectedness and acceptability judgment scores of sentences where it is used in the passive. These results suggest that English speakers may use semantic criteria to determine whether a verb is acceptable in the passive.

To test the affectedness hypothesis, we performed targeted interventions on pairs of verbs: one highly-unpassivizable verb from the duration class and one highly-passivizable verb from the agent-patient class. We treated the highly-unpassivizable verb as the mutating verb and the highly-passivizable verb as the target verb. Since Transformer language models use distributed word representations (embeddings) that are based on the contexts in which the word appears, and whose dimensions are not interpretable, we cannot directly modulate the degree of affectedness of a verb. Instead, we nudged the semantics in a more or less affected direction by changing the contexts in which the mutating verb appeared. Since a verb’s embedding is dependent on its neighbors, doing so changed the verb’s embedding, and thus its meaning to the model. We made this change by placing the mutating verbs in active sentences that originally contained the target verb; this allows the mutating verb, which is in general highly unpassivizable, to co-occur in the active with the agent-like subjects and patient-like objects normally associated with the target verb. Crucially, we did not manipulate the passive sentences containing the mutating verb, or add new passive sentences to the corpus. If altering a verb’s arguments when it appears in the active voice, and therefore its affectedness, made a previously unpassivizable verb more easily passivizable, then we would expect a decrease in the passive drop of sentences containing the mutating verb when a model is trained on one of the modified corpora.

7.1 Procedure

For each pair of mutating and target verbs, we converted 30% of the active sentences in our training corpus containing the target verb lemma into sentences containing the mutating verb lemma (e.g. dropping $\rightarrow$ lasting; dropped $\rightarrow$ lasted). Keeping the target verb in the majority of cases allows us to compare the mutating verb against the target verb in models trained on the modified corpus. In addition to the sentences in which the mutating verb occurred originally, the mutating verb then co-occurred with some of the subjects and objects that previously co-occurred with target verb. We randomly chose which sentences to alter; these sentences were interspersed throughout the corpus. Examples of sentences that underwent intervention using the mutating verb last and target verb drop are given in (\nextx):

\pex\a

The BBC’s Geeta Pandey recently droppedlasted in to meet him. \aIf you don’t charge the attack, the arrow won’t fly very far and will droplast toward the ground. \xeSince alterations were performed on strings, the syntax of the original sentence was not a factor in this intervention: both intransitive and transitive sentences could be altered, unlike in Experiment 2A, where only transitive sentences were altered. Phrasal verbs and idiomatic expressions were also affected by this process, as illustrated in (\lastxa). Although the semantics of this new verb was not explicitly specified, the distribution of the arguments of the mutating verb changed through our intervention: the arguments of the mutating verb in the added sentences might be ones which co-occur with other agent-patient verbs, or ones that occur more frequently in the passive than the mutating verb’s original arguments.

After obtaining modified corpora for each pair of mutating and target verbs, five models were trained on each modified corpus and acceptability judgments were obtained from the models following the procedure outlined in Experiment 1B.

7.2 Results

Figure 10 reports the change in passive drop of mutating and non-mutating verbs as a result of altering the mutating verb’s arguments in the training data. Verbs that did not undergo any intervention, which we call non-mutating verbs, did not show any significant change, as expected. The mutating verbs did not all show changes in passive drop either.

We found a small decrease in passive drop for the mutating verb last, but no consistent change in passive drop for sentence pairs with the mutating verbs require and take.

We first determined if the effect of the intervention was significant, abstracting away from the specific verb by including the identity of the verb as a random effect in a linear mixed-effects model. Specifically, we modeled the relationship between passive drop and training corpus type, using by-model and by-verb frame random intercepts as well as by-verb random slopes and intercepts for whether the verb is mutating in the training data. We compared this model against a full model that additionally included whether the verb has been mutating in the training corpus as a fixed effect. A likelihood-ratio test indicated that the full model does not provide a better fit for the data than the reduced model, $\chi^{2}(1)=1.32,p=0.34$ . Thus, we found that altering the arguments a verb is seen with in the training dataset does not significantly affect the verb’s passivizability when verb-level differences are accounted for.

We next verified that the verb last behaved differently from require and take when mutated. We fit a linear mixed-effects model to model the passive drop of the duration verbs. We used training corpus, verb, whether the verb was mutating, as well as the interaction between each verb and its mutating status as main effects. We included random intercepts for frame and participant, as well as by-frame random slopes for verb. We compared this full model to a reduced model that excluded the interaction between verb and whether the verb was mutating. A likelihood-ratio test indicated that the full model provided a better fit for the data than the reduced model, $\chi^{2}(1)=44.4,p<0.001$ , suggesting that our intervention of mutating a verb’s arguments interacted with the specific verb being mutated. Specifically, whereas modifying the training corpus with last as the mutating verb led to an average decrease in passive drop of 3.37 points ( $p<0.001$ ), no significant change was observed from our modification when we used require ( $p$ =0.12) or take ( $p=0.11$ ) as the mutating verb.

7.3 Discussion

Changing the arguments which a mutating verb co-occurred with did not consistently affect its passivizability. Instead, while the mutating verb last showed a decrease in passive drop, the mutating verbs require and take showed no significant change in passive drop after our intervention. We attribute this verb-specific difference in behavior to the original distribution of the mutating verbs in the training data. The verbs require and take have relatively commonly-used senses that are more affected and are also compatible with the passive, illustrated in these sentences from the training corpus:

\pex\a

State University was required to take 150 rooms for at least six nights. \aThe scientists said that more research was needed. \xe

Meanwhile, the verb last does not have any commonly-used alternative senses that are compatible with the passive. Our intervention may thus changed the distribution of arguments associated with last, but not to take and require, since these verbs were already compatible with the passive in another sense.

Since neural networks use the same word embedding to represent all different senses of a single word, this finding could indicate that the neural network displays a ‘spillover’ of passivizability from passivizable senses to unpassivizable senses within a given verb. For example, the passivizability of take in (\lastxa) may have increased the passivizability of take in our test sentences even though our test sentences use a different sense of the word. One implication of this account is that the passivizable senses of took and require may be subject to the reverse effect: speakers may judge these senses to be less acceptable due to the presence of other unpassivizable senses. This finding raises the question of whether homonymy and polysemy affect how speakers judgments of passivizability.

8 General Discussion

Learners must make use of evidence in their input in order to learn exceptions to otherwise general rules. What kinds of evidence do they use to learn these exceptions? This question has motivated many hypotheses about the indirect evidence that a learner may use to learn implicit constraints. We used a neural network language model trained on approximately the amount of text that a human learner sees during childhood as a test bed to explore this question. We found that neural networks can learn to make judgments that approximate human acceptability judgments on exceptions to the English passive to a large extent, suggesting that sufficient evidence for passive exceptions was present in the data. By performing targeted interventions on the model’s training data, we also showed that the relative frequency with which a verb appears in the active and passive constructions in the training data provided significant indirect evidence for the model to learn exceptionality, while the lexical semantics of the exceptional verbs did not consistently have significant effects. These findings point to the learnability of exceptions to broad generalizations from indirect evidence.

8.1 Human judgments of exceptions to passivization

We used the English passive as a case study of exceptions to syntactic generalizations. As much of the existing literature on these exceptions relies on informally-collected binary acceptability judgments, we sought to verify these intuitions through an acceptability judgment task in Experiment 1. Instead, we found that although some verbs and classes of verbs are unpassivizable as reported in the literature, this intuition is not borne out for all verb classes reported as unpassivizable — specifically, verbs in the ooze and advantage classes were not significantly different from canonically passivizable verbs. We additionally showed that English speakers’ judgments of passive exceptions are more nuanced than binary acceptability judgments might suggest. Not all verbs which were reported to be unacceptable were equally unacceptable: although last and resembled are both reported as unacceptable, we find that there is a much larger average passive drop for the former.

The exceptions to passivization that we present in this paper complicate theories of passivization that assume passivizability to be a binary feature. The gradience of participants’ judgments on passive sentences provides a window into the complex interplay between syntax, semantics, and pragmatics that is invoked through the acceptability judgment task (Sprouse, 2015; Sprouse et al., 2016). We hope that future theoretical work engages with these findings critically.

8.2 Using neural networks as models of human learners

What benefits do neural network language models bring to the study of human language acquisition? Existing studies of how children acquire passive restrictions (e.g. Fox and Grodzinsky 1998; Gordon and Chafetz 1990; Maratsos 1985) are limited by the inability to exert full control over a child’s linguistic input: natural language experiments and corpus studies are limited not only by inherent variability in speakers’ input and the intractability of comprehensively tracking what input participants are exposed to, but also by the difficulty of isolating and manipulating features of potential interest (Culbertson and Schuler, 2019). It would be difficult, for instance, to ensure that a child never hears a specific verb in the passive voice, as we implemented in Experiment 2. To some extent these limitations can be addressed in human artificial language learning experiments, which target specific hypotheses of learning through controlled experimentation on constructed languages. However, the test languages in these experiments are necessarily simpler than natural languages — often using vocabularies of under 50 items — and participants are exposed to these languages in an experimental setting where they are actively engaged in learning, unlike in first-language acquisition scenarios (Ettlinger et al., 2016). These differences in paradigm and content may lead to meaningful differences in the outcome of learning.

Using neural networks as a theory of acquisition addresses these existing methodological lacunae in acquisition studies. Neural networks, unlike human learners, are trained on data which the researcher has full control over. Unlike the symbolic models sometimes used in language acquisition research, which are often simplified proof-of-concept systems, neural networks are broad-coverage models that can be trained on a corpus that is as close to the input to human children as possible (Warstadt et al., 2023; Vong et al., 2024). Researcher control is not limited to to the ability to intervene on the data, as we do in this paper: working with neural networks allows for the ability to probe a model’s internal processes to understand which mechanisms are vital to the model’s learning process and form hypotheses about how humans may learn (Baroni, 2022; Lakretz et al., 2021).

That being said, there are a number of methodological limitations to our approach. Firstly, the value of modeling is limited by the interpretability and cognitive plausibility of the models we use (Baroni, 2022). Without a clear understanding of the inductive biases of the particular neural network chosen for comparison, we cannot make a fair comparison between these models and our theories of human cognition. Although we highlighted some similarities between the GPT-2 architecture and human language learning and processing, this architecture is clearly not a perfect model for human language learning (for example, transformers’ working memory constraints are fundamentally different from those of humans; Armeni et al. 2022; Timkey and Linzen 2023), and care should be taken to make fair comparisons between the two.

Secondly, the methodology of intervening on a corpus is highly reliant on the tools available: without the ability to precisely and accurately parse and alter a corpus containing heterogenous data, it is difficult to ensure the feasibility and reliability of any intervention. For instance, the filters we used to make our interventions may have introduced new confounds in the training data. While we were interested in transplanting verbs into agent-patient environments in Experiment 2B, our intervention also targeted sentences containing idiomatic expressions and senses of the verb that were outside our scope. Refining the filtering process would allow us to test hypotheses that are more narrowly defined, and thus to make conclusions at a more granular level.

Finally, the computational cost of training models on each modified corpus is high. We trained a total of 125 models for Experiment 2, each requiring 2 days and using approximately 3e15 floating point operations (FLOP). These training regimes are highly computationally expensive and their potential environmental impacts should be considered. These limitations notwithstanding, we hope that the use of targeted interventions on naturalistic data can be used to compare the plausibility of hypotheses in situations where such interventions are not possible in human research.

8.3 Implications for human learners

To what extent do our results extend to human learners? We have shown through our frequency intervention (Experiment 2A) that models can leverage the relative frequency of the active and passive constructions in training data to learn exceptions; this finding is consistent with usage-based approaches to human language acquisition (Tomasello, 2000; Goldberg, 2006), and can be taken as an existence proof that a learner could demonstrate a human-like pattern through some degree of reliance on frequency statistics.

At the same time, humans and language models clearly differ in their learning mechanisms, goals, and the resources they have for learning. It is possible, then, that while our models’ acceptability judgments are largely similar to those of humans, they achieve such behavior via a very different developmental pathway. Indeed, since our neural networks learn the task of next word prediction by tracking word co-occurrences across a corpus, the fact that they are reliant on frequency statistics is unsurprising. Although human learners are also highly sensitive to statistical information in their linguistic input (Saffran et al., 1996; Thompson and Newport, 2007), they do not rely solely on statistical knowledge in learning: interactive and communicative social pressures also affect how people learn and use language. When a child produces an utterance that an adult finds unacceptable, the adult may repeat that utterance but produce a change in the erroneous portion which the child may then take up in their next utterance if the correction matches their intended meaning (Chouinard and Clark, 2003). Children can learn from these reformulations in order to further their communicative goals — goals which our models lack.

In addition to learning from language, humans can make use of their experiences with objects to learn the concepts of causality and affectedness, which are key abstractions in Experiment 2B. Our models were trained without access to sensorimotor experiences in the real world, and thus may lack these conceptual primitives (e.g. Lake et al. 2017). Having sensory grounding for the physical actions that correspond to verbs could make lexical semantics a more dominant factor in how humans generalize about passivization than our models. In the same vein, the fact that our lexical semantic intervention was not significant in Experiment 2B does not show conclusively that semantics has no role to play in the learning of passive exceptions. If Experiment 2B were conducted on a model with access to the conceptual primitives of cause and effect, that model may be able to leverage its existing sensitivity to cause and effect to learn sensitivity to affectedness, and thus use the correlational evidence provided by sensitivity to affectedness to learn passive exceptions.

In sum, while we have illustrated a potential pathway by which a neural network learner might learn the passivizability of a verb, the same learning mechanisms might not be at play in human learners. Repeating these experiments using models that differ in architecture, and in particular models that have access to interaction and/or causal primitives, may help to disentangle the extent to which these capacities are required to learn about passivizability.

8.4 Alternative hypotheses

Frequency of occurrence is likely not the only way in which our models learn a verb’s passivizability. In fact, neither of the hypotheses we tested fully accounts for how the model itself judges a sentence’s passivizability: our interventions in both Experiments 2A and 2B never made any mutating verb as passivizable as the corresponding target verb.

What could account for the gap between passivizable and unpassivizable verbs that is not accounted for by our interventions? One possible explanation for this result could be an interaction between frequency and lexical semantics that boosts the (un)passivizability of a verb: while we tested entrenchment and lexical semantics independently as sources of evidence for passive exceptions, our study does not account for potential super-additive interactions between these two hypotheses.

Both humans and language models may also rely on evidence other than construction frequency and lexical semantics to learn which verbs passivize. One potential such source of evidence comes from the existence of similar constructions to the long passive — such as a sentence like (\nextxa), which differs from the passive sentence (\nextxb) by just one word: \pex\aTwo hours were required for the meeting. \aTwo hours were required by the meeting. \xe

The existence of an alternation like (\lastx) could affect the acquisition of passive exceptions in two ways. During acquisition, if learners often hear (\lastxa) in contexts where they might otherwise expect (\lastxb) to be said, they may conclude that (\lastxb) is unacceptable through the process of statistical pre-emption (Boyd and Goldberg, 2011; Clark, 1987; Goldberg, 1995). Secondly, the existence of alternatives like (\lastxa) for some but not all verbs may also help to explain the gradience in acceptability that we see across verbs within a verb class in Experiment 1A through noisy channel processing (Gibson et al., 2013). If English speakers find (\lastxb) to be unacceptable but have access to an alternative like (\lastxa) that is acceptable, they may process (\lastxb) as a corruption of the acceptable (\lastxa), and thus judge it as more acceptable than a similar passive sentence for which there is no corresponding alternative. These hypotheses and their interactions can be explored through similar interventions on training data as we implement in this paper.

9 Conclusion

How can we explore how the outcome of learning is affected by exposure to linguistic input on the scale of years of language? In this paper, we studied how exceptions to passivization in English, which are rare and must be learned through indirect evidence, can be learned by a neural network language model. We used targeted changes to the training corpus of the neural network language models to test the causal links between input and the models’ behavior. We first tested whether a model can learn to match human acceptability judgments well on which verbs can and cannot be passivized (Experiment 1). We then made targeted changes to the training data of our model to measure the effects of frequency and lexical semantics, two factors that have been argued to be implicated in the learning of passives in humans, on the models’ learning of these patterns. We found that while changing cues to a verb’s frequency of occurrence in the passive significantly affects the models’ judgments of its passivizability (Experiment 2A), changing the arguments with which it appears does not (Experiment 2B). These findings illustrate a method for testing hypotheses about how large-scale linguistic input affects learning, as well as for raising new questions that can be tested on humans.

CRediT authorship contribution statement

Cara Su-Yi Leong: Conceptualization, Data curation, Formal analysis, Methodology, Investigation, Visualization, Writing – original draft, Writing - review & editing. Tal Linzen: Conceptualization, Formal analysis, Funding acquisition, Methodology, Writing – review & editing, Supervision.

Declaration of competing interest

The authors declare that there is no any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Acknowledgments

This material is based upon work supported by the National Science Foundation (NSF) under Grant No. BCS-2114505. We thank the audience of Society for Computation in Linguistics 2023 and members of the NYU Computation and Psycholinguistics Lab for comments. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

Data availability

The materials, acceptability judgment data, and analysis scripts are available at the following website: https://github.com/craaaa/exceptions

References

Ambridge et al. (2016) Ambridge, B., Bidgood, A., Pine, J.M., Rowland, C.F., Freudenthal, D., 2016. Is Passive Syntax Semantically Constrained? Evidence From Adult Grammaticality Judgment and Comprehension Studies. Cognitive Science 40, 1435–1459. doi:10.1111/cogs.12277.
Armeni et al. (2022) Armeni, K., Honey, C., Linzen, T., 2022. Characterizing verbatim short-term memory in neural language models, in: Fokkens, A., Srikumar, V. (Eds.), Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid). pp. 405–424. doi:10.18653/v1/2022.conll-1.28.
Bach (1980) Bach, E.W., 1980. In Defense of Passive. Linguistics and Philosophy 3, 297–341. arXiv:25001027.
Baker (1979) Baker, C.L., 1979. Syntactic Theory and the Projection Problem. Linguistic Inquiry 10, 533–581. arXiv:4178133.
Baroni (2022) Baroni, M., 2022. On the proper role of linguistically-oriented deep net analysis in linguistic theorizing. arXiv:2106.08694.
Beavers (2011) Beavers, J., 2011. On affectedness. Natural Language & Linguistic Theory 29, 335–370. arXiv:41475291.
Boyd and Goldberg (2011) Boyd, J.K., Goldberg, A.E., 2011. Learning What NOT to Say: The Role of Statistical Preemption and Categorization in A-Adjective Production. Language 87, 55–83. doi:10.1353/lan.2011.0012.
Braine and Brooks (1995) Braine, M.D.S., Brooks, P.J., 1995. Verb Argument Structure and the Problem of Avoiding an Overgeneral Grammar, in: Tomasello, M., Merriman, W.E. (Eds.), Beyond Names for Things: Young Children’s Acquisition of Verbs. L. Erlbaum, Hillsdale, N.J, pp. 353–376.
Brooks and Tomasello (1999) Brooks, P.J., Tomasello, M., 1999. Young children learn to produce passives with nonce verbs. Developmental Psychology 35, 29. doi:10.1037/0012-1649.35.1.29.
Brown (1973) Brown, R., 1973. A First Language: The Early Stages. Harvard University Press., Cambridge, MA.
Chouinard and Clark (2003) Chouinard, M.M., Clark, E.V., 2003. Adult reformulations of child errors as negative evidence. Journal of Child Language 30, 637–669. doi:10.1017/S0305000903005701.
Clark (1987) Clark, E.V., 1987. The principle of contrast: A constraint on language acquisition, in: Mechanisms of Language Aquisition. Lawrence Erlbaum Associates, Inc, Hillsdale, NJ, US, pp. 1–33.
Comrie (1988) Comrie, B., 1988. Passive and voice, in: Shibatani, M. (Ed.), Passive and Voice. John Benjamins Publishing Company. Typological Studies in Language, pp. 9–24. doi:10.1075/tsl.16.04com.
Comrie et al. (1977) Comrie, B., Cole, P., Sadock, J.M., 1977. In Defense of Spontaneous Demotion: The Impersonal Passive, in: Grammatical Relations. Brill. chapter Grammatical Relations, pp. 47–58. doi:10.1163/9789004368866_004.
Culbertson and Schuler (2019) Culbertson, J., Schuler, K., 2019. Artificial Language Learning in Children. Annual Review of Linguistics 5, 353–373. doi:10.1146/annurev-linguistics-011718-012329.
Dankers et al. (2022) Dankers, V., Lucas, C., Titov, I., 2022. Can Transformer be Too Compositional? Analysing Idiom Processing in Neural Machine Translation, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland. pp. 3608–3626. doi:10.18653/v1/2022.acl-long.252.
Darmasetiyawan et al. (2022) Darmasetiyawan, I.M.S., Messenger, K., Ambridge, B., 2022. Is Passive Priming Really Impervious to Verb Semantics? A High-Powered Replication of Messenger Et al. (2012). Collabra: Psychology 8, 31055. doi:10.1525/collabra.31055.
Demuth (1989) Demuth, K., 1989. Maturation and the Acquisition of the Sesotho Passive. Language 65, 56–80. doi:10.2307/414842, arXiv:414842.
Demuth (2011) Demuth, K., 2011. The role of frequency in language acquisition, in: The Role of Frequency in Language Acquisition. De Gruyter Mouton, pp. 383–388. doi:10.1515/9783110977905.383.
Demuth et al. (2010) Demuth, K., Moloi, F., Machobane, M., 2010. 3-Year-olds’ comprehension, production, and generalization of Sesotho passives. Cognition 115, 238–251. doi:10.1016/j.cognition.2009.12.015.
Ettlinger et al. (2016) Ettlinger, M., Morgan-Short, K., Faretta-Stutenberg, M., Wong, P.C., 2016. The Relationship Between Artificial and Second Language Learning. Cognitive Science 40, 822–847. doi:10.1111/cogs.12257.
Fox and Grodzinsky (1998) Fox, D., Grodzinsky, Y., 1998. Children’s Passive: A View from the By-Phrase. Linguistic Inquiry 29, 311–332. arXiv:4179020.
Gibson et al. (2013) Gibson, E., Piantadosi, S.T., Brink, K., Bergen, L., Lim, E., Saxe, R., 2013. A Noisy-Channel Account of Crosslinguistic Word-Order Variation. Psychological Science 24, 1079–1088. doi:10.1177/0956797612463705.
Gilkerson et al. (2017) Gilkerson, J., Richards, J.A., Warren, S.F., Montgomery, J.K., Greenwood, C.R., Kimbrough, O.D., Hansen, J.H.L., Paul, T.D., 2017. Mapping the Early Language Environment Using All-Day Recordings and Automated Analysis. American Journal of Speech-Language Pathology 26, 248–265. doi:10.1044/2016_AJSLP-15-0169.
Gokaslan and Cohen (2019) Gokaslan, A., Cohen, V., 2019. OpenWebText corpus.
Goldberg (1995) Goldberg, A.E., 1995. Constructions: A Construction Grammar Approach to Argument Structure. Cognitive Theory of Language and Culture Series, University of Chicago Press, Chicago, IL.
Goldberg (2006) Goldberg, A.E., 2006. Constructions at Work: The Nature of Generalization in Language. Oxford Linguistics. 1. publ ed., Oxford University Press, Oxford.
Gordon and Chafetz (1990) Gordon, P., Chafetz, J., 1990. Verb-based versus class-based accounts of actionality effects in children’s comprehension of passives. Cognition 36, 227–254. doi:10.1016/0010-0277(90)90058-R.
Gulordava et al. (2018) Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., Baroni, M., 2018. Colorless Green Recurrent Networks Dream Hierarchically, in: Walker, M., Ji, H., Stent, A. (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana. pp. 1195–1205. doi:10.18653/v1/N18-1108.
Hawkins et al. (2020) Hawkins, R., Yamakoshi, T., Griffiths, T., Goldberg, A., 2020. Investigating representations of verb bias in neural language models, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online. pp. 4653–4663. doi:10.18653/v1/2020.emnlp-main.376.
Honnibal et al. (2020) Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A., 2020. spaCy: Industrial-strength natural language processing in python.
Keenan and Dryer (2007) Keenan, E.L., Dryer, M.S., 2007. Passive in the world’s languages, in: Shopen, T. (Ed.), Language Typology and Syntactic Description. 2 ed.. Cambridge University Press, pp. 325–361. doi:10.1017/CBO9780511619427.006.
Kingma and Ba (2015) Kingma, D.P., Ba, J., 2015. Adam: A Method for Stochastic Optimization, in: arXiv:1412.6980 [Cs]. arXiv:1412.6980.
Lake et al. (2017) Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J., 2017. Building machines that learn and think like people. Behavioral and Brain Sciences 40, e253. doi:10.1017/S0140525X16001837.
Lakretz et al. (2021) Lakretz, Y., Hupkes, D., Vergallito, A., Marelli, M., Baroni, M., Dehaene, S., 2021. Mechanisms for handling nested dependencies in neural-network language models and humans. Cognition 213, 104699. doi:10.1016/j.cognition.2021.104699.
Lau et al. (2017) Lau, J.H., Clark, A., Lappin, S., 2017. Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge. Cognitive Science 41, 1202–1241. doi:10.1111/cogs.12414.
Leong and Linzen (2023) Leong, C.S.Y., Linzen, T., 2023. Language Models Can Learn Exceptions to Syntactic Rules, in: Proceedings of the Society for Computation in Linguistics, University of Massachusetts Amherst, Amherst. doi:10.7275/H25Z-0Y75.
Levin (1993) Levin, B., 1993. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago.
Linzen (2020) Linzen, T., 2020. How Can We Accelerate Progress Towards Human-like Linguistic Generalization?, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online. pp. 5210–5217. doi:10.18653/v1/2020.acl-main.465.
Linzen et al. (2016) Linzen, T., Dupoux, E., Goldberg, Y., 2016. Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies. Transactions of the Association for Computational Linguistics 4, 521–535. doi:10.1162/tacl_a_00115.
MacWhinney (2000) MacWhinney, B., 2000. The CHILDES Project: Tools for Analyzing Talk. Third Edition. Lawrence Erlbaum Associates, Mahwah, NJ.
Maratsos (1985) Maratsos, M., 1985. Semantic restrictions on children’s passives. Cognition 19, 167–191. doi:10.1016/0010-0277(85)90017-4.
Marvin and Linzen (2018) Marvin, R., Linzen, T., 2018. Targeted Syntactic Evaluation of Language Models, in: Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J. (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium. pp. 1192–1202. doi:10.18653/v1/D18-1151.
Messenger et al. (2012) Messenger, K., Branigan, H.P., McLean, J.F., Sorace, A., 2012. Is young children’s passive syntax semantically constrained? Evidence from syntactic priming. Journal of Memory and Language 66, 568–587. doi:10.1016/j.jml.2012.03.008.
Misra and Mahowald (2024) Misra, K., Mahowald, K., 2024. Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs. arXiv:2403.19827.
Patil et al. (2024) Patil, A., Jumelet, J., Chiu, Y.Y., Lapastora, A., Shen, P., Wang, L., Willrich, C., Steinert-Threlkeld, S., 2024. Filtered Corpus Training (FiCT) Shows that Language Models can Generalize from Indirect Evidence. arXiv:2405.15750.
Perfors et al. (2010) Perfors, A., Tenenbaum, J.B., Wonnacott, E., 2010. Variability, negative evidence, and the acquisition of verb argument constructions. Journal of Child Language 37, 607–642. doi:10.1017/S0305000910000012.
Pinker (1989) Pinker, S., 1989. Learnability and Cognition: The Acquisition of Argument Structure. Learning, Development, and Conceptual Change. 1. paperback ed., 4. print ed., MIT Press, Cambridge, Mass.
Pinker et al. (1987) Pinker, S., Lebeaux, D.S., Frost, L.A., 1987. Productivity and constraints in the acquisition of the passive. Cognition 26, 195–267. doi:10.1016/S0010-0277(87)80001-X.
Postal (2004) Postal, P.M., 2004. Skeptical Linguistic Essays. Oxford University Press, Oxford ; New York.
Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., 2019. Language Models Are Unsupervised Multitask Learners. Technical Report. OpenAI.
Regier and Gahl (2004) Regier, T., Gahl, S., 2004. Learning the unlearnable: The role of missing evidence. Cognition 93, 147–155. doi:10.1016/j.cognition.2003.12.003.
Roland et al. (2007) Roland, D., Dick, F., Elman, J.L., 2007. Frequency of basic English grammatical structures: A corpus analysis. Journal of Memory and Language 57, 348–379. doi:10.1016/j.jml.2007.03.002.
Saffran et al. (1996) Saffran, J.R., Aslin, R.N., Newport, E.L., 1996. Statistical Learning by 8-Month-Old Infants. Science 274, 1926–1928. doi:10.1126/science.274.5294.1926.
Spearman (1910) Spearman, C., 1910. Correlation Calculated from Faulty Data. British Journal of Psychology, 1904-1920 3, 271–295. doi:10.1111/j.2044-8295.1910.tb00206.x.
Sprouse (2015) Sprouse, J., 2015. Three open questions in experimental syntax. Linguistics Vanguard 1, 89–100. doi:10.1515/lingvan-2014-1012.
Sprouse and Almeida (2017) Sprouse, J., Almeida, D., 2017. Design sensitivity and statistical power in acceptability judgment experiments. Glossa: a journal of general linguistics 2. doi:10.5334/gjgl.236.
Sprouse et al. (2016) Sprouse, J., Caponigro, I., Greco, C., Cecchetto, C., 2016. Experimental syntax and the variation of island effects in English and Italian. Natural Language & Linguistic Theory 34, 307–344. doi:10.1007/s11049-015-9286-8.
Theakston (2004) Theakston, A.L., 2004. The role of entrenchment in children’s and adults’ performance on grammaticality judgment tasks. Cognitive Development 19, 15–34. doi:10.1016/j.cogdev.2003.08.001.
Thompson and Newport (2007) Thompson, S.P., Newport, E.L., 2007. Statistical Learning of Syntax: The Role of Transitional Probability. Language Learning and Development 3, 1–42. doi:10.1207/s15473341lld0301_1.
Timkey and Linzen (2023) Timkey, W., Linzen, T., 2023. A language model with limited memory capacity captures interference in human sentence processing, in: Bouamor, H., Pino, J., Bali, K. (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore. pp. 8705–8720. doi:10.18653/v1/2023.findings-emnlp.582.
Tomasello (2000) Tomasello, M., 2000. Do young children have adult syntactic competence? Cognition 74, 209–253. doi:10.1016/S0010-0277(99)00069-4.
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is All you Need, in: Advances in Neural Information Processing Systems (NIPS 2017), Curran Associates, Inc.
Vong et al. (2024) Vong, W.K., Wang, W., Orhan, A.E., Lake, B.M., 2024. Grounded language acquisition through the eyes and ears of a single child. Science doi:10.1126/science.adi1374.
Warstadt and Bowman (2022) Warstadt, A., Bowman, S.R., 2022. What artificial neural networks can tell us about human language acquisition, in: Algebraic Structures in Natural Language. CRC Press, pp. 17–60.
Warstadt et al. (2023) Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., Williams, A., Linzen, T., Cotterell, R., 2023. Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora, in: Warstadt, A., Mueller, A., Choshen, L., Wilcox, E., Zhuang, C., Ciro, J., Mosquera, R., Paranjabe, B., Williams, A., Linzen, T., Cotterell, R. (Eds.), Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, Association for Computational Linguistics, Singapore. pp. 1–34. doi:10.18653/v1/2023.conll-babylm.1.
Warstadt et al. (2020) Warstadt, A., Parrish, A., Liu, H., Mohananey, A., Peng, W., Wang, S.F., Bowman, S.R., 2020. BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Transactions of the Association for Computational Linguistics 8, 377–392. doi:10.1162/tacl_a_00321.
Warstadt et al. (2019) Warstadt, A., Singh, A., Bowman, S.R., 2019. Neural Network Acceptability Judgments. arXiv:1805.12471.
Wei et al. (2021) Wei, J., Garrette, D., Linzen, T., Pavlick, E., 2021. Frequency Effects on Syntactic Rule Learning in Transformers, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic. pp. 932–948. doi:10.18653/v1/2021.emnlp-main.72.
Zwicky (1987) Zwicky, A.M., 1987. Slashes in the passive. Linguistics 25. doi:10.1515/ling.1987.25.4.639.

Appendix A Stimuli

This section lists the materials for acceptability judgments.

A.1 Test sentences

Verb class	Sentence frame
Advantage	My donation many communities.
	Your actions your son.
	Our friendship our relationship.
	The gift my organization.
	The treaty both countries.
Price	Your dish ninety dollars.
	The painting 2000 dollars.
	My initiative some money.
	Your book thirty dollars.
	His actions the medal.
Ooze	my friend confidence.
	The lightbulb some light.
	My machine a sound.
	The teacher wisdom.
	The trash an odor.
Estimation	The caricature an actor.
	Your friend my brother.
	The sketch my design.
	Her son her father.
	The copy the original.
Duration	The journey three days.
	My meeting two hours.
	The surgery some time.
	Her speech seventeen minutes.
	His recovery a month.
hit	My brother hit your friend.
	A boy hit my bag.
	Your dog hit the toy.
	The child hit a monkey.
	The arrow hit the target.
kicked	My brother kicked your friend.
	A boy kicked my bag.
	Your dog kicked the toy.
	The child kicked a monkey.
	My friend kicked the wall.
carried	A boy carried my bag.
	My brother carried your friend.
	The dog carried the toy.
	Your mother carried the child.
	The donkey carried the load.
pushed	A boy pushed the cup.
	My brother pushed a child.
	A child pushed the bag.
	The mother pushed my toy.
	Your sister pushed your friend.
washed	A boy washed the cup.
	My brother washed my plate.
	A child washed the bag.
	The mother washed my toy.
	Your sister washed a towel.
dropped	A boy dropped the cup.
	My brother dropped my plate.
	A child dropped the bag.
	The mother dropped my toy.
	Your sister dropped a book.

A.2 Filler sentences

Type	Sentence
Acceptable	She was worried about the problem.
	Your knife needs to be sharpened.
	Her sister failed her test.
	The bank is located across the road.
	His mother thought that your friendship was strong.
	Attention check: select ’Completely acceptable’.
	The dog bit its owner.
	The girl was unexcited about the trip.
	Her father said that your recovery was quick.
	The meeting ended quickly.
	Your sister claimed that the machine broke.
	A woman sang beautifully.
	The ship was sunk by the enemy.
	The opportunity presented itself.
	His sister slept at my house.
	Your child played the game.
	My brother sold your friend a plate.
	It rained yesterday at noon.
	Attention check: select ’Completely acceptable’.
	Your job requires concentration.
	My mother read the child a book.
	The goldfish died alone.
	The monkey wanted to eat a banana.
	Glass bottles are very fragile.
Unacceptable	A bottle breaking last night.
	The company lent the employee.
	A cat met either mouse.
	My sister said a word all night.
	Your friend is walks home.
	On a book the floor sat.
	The driver handed the keys.
	An infant asleep.
	My friend liked your car at all.
	A doctor was give the dog a toy.
	A ball hit with great force.
	Puppy my bit hand your.
	The teacher bought for the students.
	Candlesticks a picnic.
Unacceptable	A student playing piano well.
	My bottle holds.
	Sat on the floor your sister.
	Her daughter will watches a movie.
	Attention check: select ’Completely unacceptable’.
	My key a cabinet.
	The boy saw anyone.
	The chicken killed.
	Snack this delicious taste.
	That wall are green.
	The car the light.
	Your friend lifted a finger to help.
	His friend is painted his grandmother a portrait.
	The child brought to school.
	The boy looked the picture.
	The cow are grazing in the field.
	The classroom silent.
	Any girls passed the test.
	My feelings were hurting by my brother.
	The class went to on Tuesday.
	The singer are practicing a song.
	The opportunity some wallpaper.
	There is every fly in my soup.
	Attention check: select ’Completely unacceptable’.
	The car driven.
	The doctor disliked last week.
	Box a opened the boy.
	This plates has been chipped.
	The bank will lend me.
	Your backpack heavy.
	Your mother bought any cups.
	The essay was wrote by a genius.

Appendix B Human acceptability judgment task instructions

In this experiment, you will rate English sentences based on how acceptable they sound to you. Try to answer based on your gut reaction, without analyzing the sentences. There are no ‘right’ or ‘wrong’ answers. The first two questions will be practice questions to familiarize you with the task.