Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

1

N -gram Parsing for Jointly Training a


Discriminative Constituency Parser
Arda Çelebi and Arzucan Özgür
Department of Computer Engineering
Boğaziçi University
Bebek, 34342 İstanbul, Turkey
{arda.celebi, arzucan.ozgur}@boun.edu.tr

Abstract—Syntactic parsers are designed to detect the complete this is the first study that introduces the concept of n-gram
syntactic structure of grammatically correct sentences. In this parsing. Even though syntactic parsers expect grammatically
paper, we introduce the concept of n-gram parsing, which correct and complete sentences, an n-gram parser is designed
corresponds to generating the constituency parse tree of n
consecutive words in a sentence. We create a stand-alone n-gram to parse only n consecutive words in a sentence. An outputted
parser derived from a baseline full discriminative constituency n-gram tree is still a complete parse tree, but it covers only n
parser and analyze the characteristics of the generated n-gram words instead of the whole sentence. We derive our n-gram
trees for various values of n. Since the produced n-gram trees are parser from a discriminative parser which was implemented
in general smaller and less complex compared to full parse trees, based on [12]. After analyzing the characteristics of n-gram
it is likely that n-gram parsers are more robust compared to full
parsers. Therefore, we use n-gram parsing to boost the accuracy parsing, we train the full parser together with the n-gram
of a full discriminative constituency parser in a hierarchical parser. Our underlying hypothesis is that the n-gram parser
joint learning setup. Our results show that the full parser jointly will help the full parser at cases where the n-gram parser
trained with an n-gram parser performs statistically significantly does better. We performed experiments with different n-gram
better than our baseline full parser on the English Penn Treebank sizes on the English Penn treebank corpus [13] and obtained a
test corpus.
statistically significant increase in the accuracy of the jointly
Index Terms—Constituency Parsing, n-gram Parsing, Discrim- trained full parser over the original (non-jointly trained) full
inative Learning, Hierarchical Joint Learning
parser.
This paper continues with the related studies. Following
I. I NTRODUCTION that, in Section 3, we introduce the concept of n-gram parsing

P ARSING a natural language sentence is a process of


characterizing the syntactic description of that sentence
based on the syntax of its language. Over the last half-century,
and the characteristics of the n-gram trees. In Sections 4 and
5, we describe how we perform discriminative constituency
parsing and how we use the HJL approach, respectively.
there have been many techniques developed to improve parsing Before discussing the experiments, we introduce the data and
accuracy. Some of the studies have targeted the model that the the evaluation methods that we used in Section 6. We present
parser relies on, such as by replacing rule-based approaches the experimental results obtained with the n-gram parser alone
[1], [2] with statistical models like generative [3], [4] and and the jointly trained parser. We conclude and outline future
discriminative ones [5], [6]. Others introduced external ways directions for research in the last section.
of boosting the parser, such as by using a reranker [7], [8],
by bootstrapping it with itself in a self-training setup [9], or
II. R ELATED W ORK
by using partial parsing in a co-training setup [10]. Another
recent thread of research is about a more specialized form of In this paper we tackle the problem of improving the per-
the co-training approach, where multiple models from different formance of a discriminative constituency parser by training
domains are jointly trained together and help each other to it with an n-gram parser using hierarchical joint learning.
do better. One example is [11], where they introduce the Although generative models [3], [4] still dominate the con-
Hierarchical Joint Learning (HJL) approach to jointly train stituency parsing area due to their faster training times, a num-
a parser and a named entity recognizer. Their HJL model ber of discriminative parsing approaches have been proposed
achieved substantial improvement in parsing and named entity in the recent years motivated by the success of discriminative
recognition compared to the non-jointly trained models. learning algorithms for several NLP tasks such as part-of-
In this paper, we aim to improve the accuracy of a dis- speech tagging and relation extraction. An advantage of dis-
criminative constituency parser by training it together with criminative models is their ability to incorporate better feature
another parser in the HJL setup. While our actual parser works rich representations. There are three different approaches for
on complete sentences, its accompanying parser tackles the applying discriminative models to the parsing task. The first
parsing task in a less complex way, that is, by parsing n-grams and perhaps the most successful one is to use a discriminative
instead of complete sentences. To the best of our knowledge, reranker to rerank the n-best list of a generative parser [5], [7],
2

[6]. To our knowledge, the forest-based reranker in [8] is the Oriented Parsing (DOP) models in [21]. However, unlike TSG
best performing reranker that helps its accompanying parser to trees, our n-gram trees always have words at the terminal
achieve an F1 score of 91.7%1 . The second approach considers nodes. Another related concept is the Tree Adjunct Grammar
parsing as a sequence of independent discriminative decisions (TAG) and the concept of local trees proposed in [22]. As
[14], [15]. By discriminative training of a neural network based in the case of TSG trees, TAG local trees also differ from
statistical parser, an F1 score of 90.1% is obtained in [15]. The our n-gram trees by not having words at all terminal nodes
third approach, which we adapted for this paper from [12], is to but one. Another related study was performed in [23], where
do joint inference by using dynamic programming algorithms significant error reductions in parsing are achieved by using
in order to train models and use these to predict the globally n-gram word sequences obtained from the Web.
best parse tree. With this, an F1 score of 90.9% is achieved In the literature, the concept of n-grams is used in a number
in [12]. Due to their notoriously slow training times, however, of contexts to represent a group of n consecutive items.
most discriminative parsers run on short sentences. That is This can be, for instance, characters in a string or words
why we use sentences that have no more than 15 words for in a sentence. In our research, we consider an n-gram as n
full sentence parsing in this study. consecutive words selected from a sentence. n-gram parsing,
One of the aspects of our research is that it involves then, refers to the process of predicting the syntactic structure
working with multiple parsers at the same time, that is the that covers these n words. We call this structure an n-gram
n-gram parser and the full sentence parser. There have been tree. In this paper, we study the parsing of 3- to 9-grams in
a couple of studies that experimented with multiple parsers order to observe how n-gram parsing differs with length and at
in the literature. One example is [16] which extended the which lengths the n-gram parser helps the full parser more. A
concept of rerankers by combining k-best parse outputs from sample 4-gram tree which is extracted from a complete parse
multiple parsers and achieved an F1 score of 92.6%. In [17], tree is shown in Figure 1. Compared to complete parse trees,
Fossum and Knight worked on multiple constituency parsers n-gram trees are smaller parse trees with one distinction. That
by proposing a method of parse hybridization that recombines is, they may include constituents that are trimmed from one or
context-free productions instead of constituents in order to both sides in order to fit the constituent lengthwise within the
preserve the structure of the output of the individual parsers borders of the n-gram. We call such constituents incomplete,
to a greater extent. With this approach, their resulting parser and denote them with the -INC functional tag. n-gram parsing
achieved an F1 score of 91.5%, which is 1 point better than is fundamentally no different than the conventional parsing of
the best individual parser in their setup. a complete sentence. However, n-grams, especially the short
Another thread of research related to ours is jointly training ones, may have no meanings on their own and/or can be
multiple models. This evolved from the concept of multi- ambiguous due to the absence of the surrounding context.
task learning, which is basically explained as solving multiple Even though the relatively smaller size of n-gram trees makes
problems simultaneously. In the language processing literature, it easier and faster to train on them, their incomplete and
there have been a couple of studies where this concept is ambiguous nature makes the n-gram parsing task difficult.
adapted for multi-domain learning [18], [19], [20]. In these Despite all, n-gram parsing can still be useful for the actual
studies, they make use of labelled data from multiple domains full sentence parser, just like the partial parsing of a sentence
in order to improve the performances on all of them. In used for bootstrapping [10]. In this paper, we train the full
[11], for example, a joint discriminative constituency parser parser together with an n-gram parser and let the n-gram
alongside a named-entity recognizer is used and substantial parser help the full parser at areas where the n-gram parser is
gains over the original models is reported. Like most of the better than the full one.
prior research, a derivative of the hierarchical model, namely
the Hierarchical Joint Learning (HJL) approach is used. In this
A. N -gram Tree Extraction Algorithm
paper, we adapted this approach by replacing the named-entity
recognizer with an n-gram parser, which to our knowledge In order to generate a training set for the n-gram parser,
hasn’t been attempted before in the literature. we extract n-gram trees out of the complete parse trees of the
The most distinguishing contribution of our research is the Penn Treebank, which is the standard parse tree corpus used in
introduction of n-gram parsing. To the best of our knowledge, the literature [13]. Since we use sentences with no more than
n-gram parsing has never been considered as a stand-alone 15 words (WSJ15) for complete sentence parsing, we use the
parsing task in the literature before. One reason might be that rest of the corpus (WSJOver15) for n-gram tree extraction.
n-gram trees have no particular use on their own. However, The pseudocode of our n-gram tree extraction algorithm
they have been used as features for statistical models in either is given in Algorithm 1. It takes a complete parse tree as
lexicalized or unlexicalized forms. For example, in [9] they input and returns all the extracted valid n-gram trees. It
are used to train the reranking model of the self-training starts by traversing the sentence from the first word and
pipeline. There are only a couple of studies in the literature preserves the minimum subtree that covers n consecutive
comparable to our notion of n-gram trees. One of them is words. While doing that, it may trim the tree from one or
the stochastic tree substitution grammars (TSG) used in Data both sides in order to fit the constituents lengthwise within
the borders of the n-gram. Hence, the extracted n-gram trees
1 The performance scores reported in this section are for section 23 of the may contain incomplete and thus ungrammatical constituents,
English Penn Treebank. which is not something that a conventional parser expects as
3

Fig. 1. Sample 4-gram tree extracted from a complete parse tree.

Require: n, width of the n-gram trees in the features in order to let the n-gram parser better predict
Require: tree, parse tree of a sentence such incomplete cases. Following these steps, the extraction
len ← length of the given sentence process filters out the incomplete chains of unary rules that
i←0 can be reached from the ROOT constituent. The algorithm
while i < len − n do also keeps the parent information of the ROOT constituent
subtree ← get subtree that covers [i, i + n] span of the n-gram tree as an additional context information.
trimmed ← trim subtree ’s constituents outside the
[i, i + n] span, if any B. Characteristics of n-gram Trees
if trimmed has any const. with no head child then As we apply the extraction algorithm on the WSJOver15
i++ and continue portion of the Penn Treebank, we get hundreds of thousands
end if of n-gram trees for each value of n in {3..9}. The analysis of
markedtree ← mark all trimmed consts. as incomplete these data sets reveals interesting points about the characteris-
f iltered ← filter out incomplete unary rule chain from tics of such n-gram trees. Table I gives the percentages of the
the ROOT, if any most common constituents in each n-gram training set along
save f iltered tree as n-gram tree with the corresponding numbers obtained from the complete
i++ parse trees of the WSJ15 portion, which we use for training
end while our full parser. The comparison indicates that the percentages
Fig. 2. Extracting and storing n-gram trees from a parse tree of the noun (NP), verb (VP), and prepositional (PP) phrases
in the n-gram trees are higher than the ones in the complete
parse trees. On the other hand, the percentages of long-range
constituents like S are lower for the n-gram trees, which is
input. Nevertheless, we assume that not all of the trimmed expected as the extraction process disfavors such constituents.
constituents are ungrammatical according to the concept of Nonetheless, we see higher percentage in case of another long-
generatively accurate constituents that we introduce in this range constituent SBAR, which exemplifies how the extraction
paper. This concept stems from the concept of the head- process may still favor some long-range constituents. Based on
driven constituency production process [3] where a constituent this analysis, we may postulate that the increasing percentage
is theoretically generated starting from its head child and of the NPs, VPs, and PPs per parse tree may help the n-gram
continuing towards the left and right until all children are parser do a better job in addition to the fact that they are
generated. If the head child is the origin of the production, then smaller, thus less complex phrases.
it is safe to say that it defines the constituent. Therefore, our
n-gram tree extraction process makes sure that the head child III. D ISCRIMINATIVE C ONSTITUENCY PARSING
is still included in the trimmed constituents. Otherwise, the In order to parse the n-grams and the complete sentences,
whole n-gram tree is considered generatively inaccurate and we implemented a feature-rich discriminative constituency
is thus discarded. If all the heads are preserved, the algorithm parser based on the work in [12]. It employs a discrimina-
marks the trimmed constituents with the -IN C functional tag. tive model with the Conditional Random Field (CRF) based
For example, the PP constituent of the 4-gram tree in Figure 1 approach. Discriminative models for parsing maximize the
is trimmed from right hand side and since the head child “IN” conditional likelihood of the parse tree given the sentence.
is still included in the constituent, it is considered generatively The conditional probability of the parse tree is calculated as
accurate. The corresponding constituent is marked with an - in Equation 1, where Zs is the global normalization function
INC tag and the extracted tree is stored as a valid 4-gram tree. and φ(r|s; θ) is the local clique potential.
However, if we try to extract the next 4-gram in the same
sentence, it would have failed due to not being able to keep 1 Y
P (t|s; θ) = φ(r|s; θ) (1)
the head of the rightmost NP. The -INC tags are later used Zs r∈t
4

TABLE I
P ERCENTAGE OF THE MOST COMMON NON - TERMINAL CONSTITUENTS IN TRAINING SETS
Model NP VP PP ADJP ADVP S SBAR QP
3-gram 21.20 10.82 6.25 1.04 1.03 4.68 1.43 0.96
4-gram 20.69 10.87 6.44 1.05 1.07 4.93 1.64 0.85
5-gram 20.39 10.87 6.45 1.06 1.10 5.11 1.75 0.80
6-gram 20.11 10.81 6.49 1.04 1.10 5.21 1.84 0.79
7-gram 19.86 10.78 6.49 1.02 1.11 5.29 1.92 0.78
8-gram 19.68 10.74 6.50 1.01 1.11 5.35 1.99 0.77
9-gram 19.54 10.68 6.50 1.00 1.10 5.39 2.04 0.76
Full (WSJ15) 18.74 9.51 4.21 1.05 1.69 6.91 1.00 0.60

where X Y the difference, we tried to keep the size and the types of
Zs = φ(r|s; θ) (2) contents comparable to [12]. We use the default parameter
t0 ∈τ (s) r∈t0 settings for the tool provided by [24] and set the number of
X clusters to 200. In order to handle out-of-vocabulary (OOV)
φ(r|s; θ) = exp θi fi (r, s) (3) words better, we also introduce a new lexicon feature template
i hpref ix, suf f ix, base(tag)i, which makes use of the most
The probability of the parse tree t given the sentence s is the common English prefixes and suffixes. A feature is created by
product of the local clique potentials for each one-level subtree putting together the prefix and suffix of a word, if available,
r of a parse tree t which is normalized by the total local clique along with the base tag of that word. If it does not have
potential of all possible parse trees defined by τ (s). Note that any prefixes or suffixes, N A is used, instead. As for n-gram
the clique potentials are not probabilities. They are computed parsing, we did not include or exclude any features. Like in
by taking the exponent of the summation of the parameter [12], we also implemented chart caching and parallelization
values θ for the features that are present for a given subtree r. in order to save time.
The function fi (r, s) returns 1 or 0 depending on the presence
or absence of feature i in r, respectively. Given a set of training IV. H IERARCHICAL J OINT L EARNING
examples, the goal is to choose the parameter values θ such In this section, we show how we jointly train the n-gram
that the conditional log likelihood of these examples, i.e., the parser and the full parser. We use an instance of the multi-task
objective function L given in Equation 4, is minimized. learning setup called the Hierarchical Joint Learning (HJL)
approach introduced in [11]. HJL enables multiple models to
learn more about their tasks due to the commonality among
!
X X X θ2
i
L(D; θ) = hf (r, s), θi − logZs,θ − 2 the tasks. By using HJL, we expect the n-gram parser to help

(t,s)∈D r∈t i the full parser in cases where the n-gram parser is better.
(4) As described in [11], the HJL setup connects all the base
When the partial derivative of our objective function with models with a top prior, which is set to zero-mean Gaussian in
respect to the model parameters is taken, the resulting gradient our experiments. The only requirement for HJL is that the base
in Equation 5 is basically the difference between the empirical models need to have some common features in addition to the
counts and the model expectations, along with the derivative set of features specific to each task. As both parsers employ
of the L2 regularization term to prevent over-fitting. These the same set of feature templates, they have common features
partial derivatives which are calculated with the inside-outside through which HJL can operate. All the shared parameters
algorithm by traversing all possible parse trees for a given between base models are connected to each other through this
sentence are then used to update the parameter values at each prior. It keeps the values of the shared parameters from each
iteration. As in [12], we use stochastic gradient descent (SGD), base model close to each other by keeping them close to itself.
which updates parameters with a batch of training instances The parameter values for the shared features are updated by
instead of all in order to converge to the optimum parameter incorporating the top model feature into the parameter update
values faster. function as in Equation 6. While the first term is calculated
! by using the update value from Equation 5, the second term
∂L X X θi ensures that the base model m is not getting apart from the
= fi (r, s) − Eθ [fi |s] − 2 (5)
∂θi r∈t
σ top model by taking the difference between the top model and
(t,s)∈D
the corresponding shared parameter value. The variance σd2 is
We use the same feature templates of [12] and the same tool a parameter to tune this relation.
[24] to calculate the distributional similarity clusters which are
used in the feature definitions. However, we use a different
∂Lhier (D; θ) ∂Lhier (Dm , θm ) θm,i − θ∗,i
combination of corpora to calculate these clusters. We gathered = − (6)
an unlabelled data set of over 280 million words by combining ∂θm,i ∂θm,i σd2
Reuters RCV1 corpus [25], Penn treebank, and a large set As shown in Equation 7, the updates for the top model
of newswire articles downloaded over the Internet. Despite parameter values are calculated by summing the parameter
5

2
value differences divided by the base model variance σm , and VI. E XPERIMENTS AND D ISCUSSION
then by subtracting the regularization term to prevent over- A. Baseline Parser
fitting.
Our baseline parser is a discriminative constituency parser
X θ∗,i − θm,i
! that runs on complete sentences. In order to make it run at
∂Lhier (D; θ) θ∗,i
= − 2 (7) its best, we set the learning factor η to 0.1 and the variance
∂θ∗,i 2
σm σ∗
mM σ 2 to 0.1. We do 20 passes over the training set and use the
batch size of 15 for the purpose of SGD. Table III shows the
As in the case of the discriminative parser described in the
results we obtained with these settings on the development and
previous section, SGD is used for faster parameter optimiza-
test sets of WSJ15 portion of the Penn treebank. Our baseline
tion. At each epoch of SGD, a batch of training instances is
selected uniformly from each model in the setup. The number
TABLE III
of training instances coming from each set, hence, depends on R ESULTS ON THE P ENN TREEBANK .
the relative sizes of the training sets.
Dataset Precision Recall F1 score
Dev. Set 87.5 88.1 87.8
V. E XPERIMENTAL S ETUP Test Set 86.4 86.4 86.4
A. Data
parser achieves an F1 score of 87.8% on the development set
We evaluate our models by using the English Penn treebank and 86.4% on the test set. Compared to the results obtained
corpus [13]. Like previous studies in this field, we use sections in [12], our implemented version’s performance is a couple
02-21 for training, 22 for development, and 23 for testing. of points behind. The difference might be caused by small
For complete sentence parsing, we use only the sentences implementation details as well as by the different corpus that
that have no more than 15 words, that is WSJ15. To train we used to calculate the distributional similarity clusters as
the n-gram parsers, on the other hand, we use the rest of discussed in Section 4.
the Penn treebank, which we call WSJOver15. To test the
n-gram parsers, we use the n-gram trees extracted from the
B. N -gram Parser
development and the test sets of the WSJ15 in order to make
the results more comparable with the full parser. By using Before training the full parser with the n-gram parser
our n-gram tree extraction algorithm, we extract n-gram trees using HJL, we test the stand-alone n-gram parsers in order
for n=[3,9]. Table II gives the number of parse trees in the to understand where they are good at or where they fail,
training, development and test sets of each parser. especially with respect to the full parser. We experimented
with seven different n-gram sizes, i.e., with n = [3, 9]. Even
TABLE II
though there are hundreds of thousands of training instances
N UMBER OF PARSE TREES FOR EACH PARSER . for each parser available from the WSJOver15 portion of the
Penn treebank, we train our models with 20, 000 instances
Model Training Set Dev. Set Test Set
due to time and computational constraints. For a statistically
3-gram 384,699 1,742 2,495
4-gram 318,819 1,341 1,916 more reliable evaluation, we report the averages of the scores
5-gram 267,155 1,050 1,506 obtained from five randomly selected versions of each training
6-gram 227,505 807 1,158 set.
7-gram 195,229 635 891 We use the same experimental setup of the baseline full
8-gram 168,075 486 667
parser. However, we optimize the parameters specifically for
9-gram 145,040 349 491
Full 9,753 421 603 n-gram parsing. To make the n-gram parser run at its best, we
set the learning factor η to 0.05 and the variance σ 2 to 0.1.
Instead of doing 20 iterations like we did for the full parser, we
observe that 10 iterations are enough. We choose a batch size
B. Evaluation of 30, instead of 15 for the SGD. Both decisions are related to
We use the evalb script2 to get labelled precision, recall, the fact that the n-gram trees are relatively smaller compared
and F1 scores. These are calculated based on the number of to the complete parse trees. Thus, an n-gram parser requires
nonterminals in the parser’s output that match those in the a larger batch of training instances, but takes fewer iterations
standard/golden parse trees. We also report the percentage to get to its best performance.
of completely correct trees, the average number of brackets Table IV shows the averaged F1 scores obtained with all
crossing with the actual correct spans, and the percent of seven n-gram parsers on the development set. The comparison
guessed trees that have no crossing brackets with respect to of our n-gram parsers with each other reveals a couple of
the corresponding gold tree. In order to better understand interesting points. Firstly, using bigger n-gram trees in general
the n-gram parser and the jointly trained parser, we also leads to slightly higher F1 scores, but the increase in precision
evaluate how accurately these parsers handle different types is more apparent. Secondly, the fact that the 3-gram parser
of constituents. achieves an F1 score of 86.5% by guessing 73.13% of the
parse trees exactly, suggests that finding the exact n-gram tree
2 evalb script is available at http://nlp.cs.nyu.edu/evalb/ is mostly an easy job, yet a small set of 3-gram trees contain
6

TABLE IV
R ESULTS OF THE n- GRAM PARSERS FOR THE DEVELOPMENT SET.
Model Precision Recall F1 score Exact Avg CB No CB TagAcc
3-gram 86.37 86.60 86.49 73.13 0.02 98.15 86.80
4-gram 85.55 85.89 85.72 66.50 0.08 94.86 87.88
5-gram 86.91 86.29 86.60 61.68 0.10 93.02 89.95
6-gram 86.68 86.41 86.54 54.76 0.16 89.33 90.67
7-gram 87.49 87.00 87.24 51.29 0.21 86.52 92.17
8-gram 87.12 86.77 86.94 49.48 0.28 83.05 92.63
9-gram 87.69 86.96 87.33 46.68 0.35 79.41 92.91

most of the errors. This observation does not hold for larger C. Jointly Trained Parser
n values, since the parsing task becomes more difficult for In order to boost the accuracy of the full parser, we train
bigger trees. it along with each n-gram parser. For the full and n-gram
In order to do further analysis, we investigate how ac- models, we use the previously used variance settings, that is
curately the n-gram parsers handle the different types of 0.1. We set the top model variance σ∗ 2 to 0.1 as well. We set
constituents. Table V shows the average F1 scores of each n- the learning factors for the n-gram models and the top model
gram model for the most common constituents. The first thing to 0.1, whereas we use 0.05 for the full parsing model. With a
to notice is the degrading performance of handling noun (NP) lower learning rate, we make sure that the full parsing model
and prepositional (PP) phrases, and the improving performance starts to learn at a slower pace than usual so that it doesn’t
of handling verb phrases (VP) and declarative clauses (S) directly get into the effect of the accompanying n-gram model.
as n increases. When n increases, longer as well as more As in the case of the baseline full parser, we do 20 passes over
complex NPs and PPs are introduced. This results in degrading the training set and at each iteration, we update the parameters
performance for such phrases. On the other hand, as the sizes with a batch of 40 training instances gathered from all training
of the n-gram trees increase, it becomes easier to handle sets in the setup.
long-range constituents like VPs and Ss, since the parser sees In order to evaluate the effect of the training set size for
more of them in the training set. The same argument holds each n-gram model, we use training sets of four different
for the remaining types of constituents in Table V. Another sizes for the n-gram parsers. We execute each experiment
interesting point is the significantly lower accuracies of the three times with randomly selected training sets. Table VII
n-gram parsers on QPs, especially with smaller n-gram trees. shows the averaged F1 scores obtained by the jointly trained
full parser on the development and test sets of the WSJ15.
TABLE VI
ACCURACY ON THE INCOMPLETE CONSTITUENTS The rows indicate which models are trained together, whereas
IN THE DEVELOPMENT SET. each column corresponds to a training set of different size
for the n-gram model. In case of the full parser, we use the
% of Incomplete Incomplete % of Unidentified
Constituents Constituent Incomplete Const.
standard training set of the Penn treebank, which contains
Model in Golden Trees Accuracy w.r.t. All Unidentifieds 9, 753 instances.
3-gram 22.0 86.65 26.2 Scores in bold in Table VII indicate that the value is
4-gram 17.4 85.78 21.1 significantly3 better than the baseline value according to the
5-gram 14.5 84.20 16.7 t-test. When we compare the results with the baseline F1 score
6-gram 12.3 83.41 15.2
of 87.8% on the development set and 86.4% on the test set, we
7-gram 10.8 83.32 13.9
8-gram 9.5 83.92 11.5 observe slight improvement at some of the configurations. In
9-gram 8.7 84.94 10.4 general, the jointly trained full parser outperforms the baseline
parser when it is trained alongside an n-gram parser that uses
Table VI shows the performances of the n-gram parsers a relatively smaller training set, like 5,000 instances for the
on the incomplete constituents, as well as the percentages development set and 1,000 instances for the test set. The best
of constituents that are incomplete and the percentages of results though, are obtained by jointly training the baseline
unidentified constituents from the golden trees that are in- parser with the 9-gram parser. These results are statistically
complete. In most cases, as n increases, the accuracies on significantly better than the ones of the non-jointly trained full
the incomplete constituents decreases. The contribution of parser both for the development and test sets. In addition to
the incomplete constituents to the number of unidentified these comparisons, we also observed that within 20 iterations,
constituents decreases as well. However, this is more attributed the jointly trained full parser reaches its best performance
to the fact that their percentage with respect to all constituents faster with respect to the baseline parser, which shows the
drops as n increases. Another point to notice is that despite push of the n-gram parser over the full parser.
its high performance, more than a quarter of the constituents We also analyze how accurately the jointly trained parser4
unidentified by the 3-gram parser are incomplete. Considering
3 The superscript * adjacent to the F scores indicates that its significance is
that the 3-gram parser predicts 73.13% of the parse trees 1
p<0.01. In case of the ** and ***, it is p<0.005 and p<0.001, respectively.
completely, it is highly likely that the performance of the 3- 4 Each accompanying n-gram parser in the HJL setup uses 5,000 training
gram parser is affected by such constituents. instances.
7

TABLE V
F1 SCORES ON THE MOST COMMON CONSTITUENTS FROM THE DEVELOPMENT SET.

Model NP VP PP S SBAR ADVP ADJP QP


3-gram 89.26 89.86 92.04 84.69 56.31 73.19 46.50 75.27
4-gram 88.02 89.79 90.72 83.52 61.43 73.23 48.23 75.77
5-gram 88.35 90.42 89.81 85.77 72.39 76.31 52.22 72.96
6-gram 87.65 91.04 89.03 86.11 71.49 75.16 52.00 74.34
7-gram 87.80 91.73 88.40 87.39 77.91 76.87 50.09 84.45
8-gram 87.22 92.29 87.81 87.18 76.43 77.84 49.05 86.73
9-gram 87.79 91.76 87.33 87.46 76.37 72.21 53.66 86.43
Full 88.77 89.81 88.79 90.91 80.56 79.42 59.31 94.21

TABLE VII
AVERAGED F1 SCORES OF THE BASELINE FULL PARSER (B) JOINTLY TRAINED WITH EACH n- GRAM MODEL .

Results for Dev. Set of the WSJ15 Results for Test Set of the WSJ15
Model(s) 1K 2K 5K 10K 1K 2K 5K 10K
B+3-gram 87.60 87.69 87.97 87.91 86.37 86.07 86.23 86.21
B+4-gram 87.93 87.98 87.99∗ 87.70 86.52 86.42∗∗∗ 86.44 86.54
B+5-gram 87.72 87.67 88.00 87.72 86.36 85.84 86.35 86.33
B+6-gram 87.88 87.73 88.12∗∗ 87.66 86.55 85.95 86.24 86.31
B+7-gram 87.83 87.94 88.05 87.72 86.58 86.16 86.24 86.42
B+8-gram 87.93 87.91 87.96 87.78 86.57∗∗ 86.45 86.16 86.43
B+9-gram 88.19∗ 87.89 87.89 87.86 86.46 86.42 86.44 86.59∗∗∗

TABLE VIII
F1 SCORES OF THE JOINTLY TRAINED PARSER ON THE MOST COMMON CONSTITUENTS IN THE DEV. SET.

Model(s) NP VP PP S SBAR ADVP ADJP QP


B+3-gram 88.91 89.87 89.39 90.20 80.09 80.05 64.51 92.44
B+4-gram 88.82 90.30 89.43 90.26 81.00 79.51 61.65 92.14
B+5-gram 89.10 90.21 89.23 90.22 79.44 79.56 61.23 92.68
B+6-gram 89.19 90.15 89.30 90.26 80.55 79.75 63.18 93.07
B+7-gram 89.04 90.19 89.15 90.35 80.73 79.31 62.93 93.03
B+8-gram 88.96 90.17 88.90 90.27 80.91 79.26 61.11 92.48
B+9-gram 88.86 90.14 89.06 90.12 79.45 79.07 62.38 92.74
Baseline (B) 88.77 89.81 88.79 90.91 80.56 79.42 59.31 94.21

handles different constituent types. Table VIII shows the that the bigger n-grams we use, the better accuracies we get,
averaged F1 scores for the most common constituent types mostly due to increasing context information. We showed that
in the development set. The results indicate a couple of the n-gram parsers are better than the full parser at parsing
interesting reasons behind the slight improvement of the jointly NPs, VPs, and PPs, but worse at parsing Ss and SBARs.
trained full parser over the baseline. The first one is the slight After analyzing the stand-alone n-gram parsers, we used them
improvement on NPs as the n-grams are getting bigger, which for jointly training a full discriminative parser in the HJL
is especially visible with the best performing configuration setup in order to boost the accuracy of the full parser. We
among them, that is the one with the 6-gram model. PPs and achieved statistically significant improvement over the baseline
VPs are also better processed with almost all jointly trained scores. The analysis of the results obtained with the jointly
models. The biggest improvement is seen with the adjective trained parser revealed that the resulting parser is better at
phrases (ADJPs), especially when smaller n-grams are used. processing NPs, VPs, PPs, and surprisingly ADJPs. However,
Even though the impact of ADJPs to the overall result is it is negatively influenced by the performance of the n-gram
small compared to the other phrase types like NPs and PPs, parser over constituents like S and SBAR. Furthermore, it
this improvement is still worth mentioning. It is interesting to achieves its best performance faster than the baseline parser,
note that the same analysis on the stand-alone n-gram parsers indicating yet another benefit of training alongside an n-gram
reveals that they are not that good with ADJPs. Another thing parser.
to notice is the degrading performance over the QPs, as well as
SBAR and S type constituents due to the fact that the n-gram As future work, we plan to improve our baseline parser
parsers perform relatively worse on them (see Table V). in order to make the jointly trained parser more competitive
with respect to its peers in the literature. We will explore new
approaches for selecting better n-gram trees to improve the
C ONCLUSION AND F UTURE W ORK
quality of the training data. We also plan to use multiple n-
In this paper, we introduced n-gram parsing and analyzed gram parsers in joint training instead of just one. In addition,
how it is different than full sentence parsing. We observed we will use the n-gram trees and the HJL setup to build a
8

self-trained parser by expanding the n-gram parser’s training


data with n-gram trees extracted from the output of the full
sentence parser. This will enable the full sentence parser to be
indirectly trained with its own output.

ACKNOWLEDGMENTS
We thank Brian Roark and Suzan Üskudarlı for their in-
valuable feedback. This work was supported by the Boğaziçi
University Research Fund 12A01P6.

R EFERENCES
[1] Kasami, T.: An efficient recognition and syntax-analysis algorithm for
context-free languages. Technical report, Air Force Cambridge Research
Lab (1965)
[2] Earley, J.: An effcient context-free parsing algorithm. Communications
of the ACM 13(2) (1970) 94–102
[3] Collins, M.: Head-driven statistical models for natural language pars-
ing. PhD thesis, Department of Computer and Information Science,
University of Pennsylvania (1999)
[4] Charniak, E.: Statistical parsing with a context-free grammar and word
statistics. Proceedings of AAAI-97 (1997) 598–603
[5] Ratnaparkhi, A.: Learning to parse natural language with maximum
entropy models. Machine Learning 34(1-3) (1999) 151–175
[6] Charniak, E.: A maximum-entropy-inspired parser. Proceedings of the
North American Association of Computational Linguistics (2000)
[7] Collins, M.: Discriminative reranking for natural language parsing.
Proceedings of ICML-00 (2000) 175–182
[8] Huang, L.: Forest reranking: Discriminative parsing with non-local
features. Proceedings of Ninth International Workshop on Parsing
Technology (2005) 53–64
[9] McClosky, D., Charniak, E., Johnson, M.: Effective self-training for
parsing. Proceedings of HLT-NAACL (2006)
[10] Abney, S.: Part-of-speech tagging and partial parsing. Corpus-Based
Methods in Language and Speech Processing, Kluwer Academic Pub-
lishers, Dordrecht (1999)
[11] Finkel, J.R., Manning, C.D.: Hierarchical joint learning: Improving
joint parsing and named entity recognition with non-jointly labeled data.
Proceedings of ACL 2010 (2010)
[12] Finkel, J.R., Kleeman, A., Manning, C.D.: Efficient, feature-based
conditional random field parsing. Proceedings of ACL/HLT-2008 (2008)
[13] Marcus, M., Santorini, B., MarcinKiewicz, M.A.: Building a large
annotated corpus of English: The Penn Treebank. Computational
Linguistics 19(2) (1993) 313–330
[14] Ratnaparkhi, A.: A linear observed time statistical parser based on
maximum entropy models. Proceedings of EMNLP (1997) 1–10
[15] Henderson, J.: Discriminative training of a neural network statistical
parser. 42nd ACL (2004) 96–103
[16] Zhang, H., Zhang, M., Tan, C.L., Li, H.: K-best combination of syntactic
parsers. Proceedings of EMNLP 2009 (2009) 1552–1560
[17] Fossum, Knight, K.: Combining constituent parsers. Proceedings of
NAACL 2009 (2009) 253–256
[18] III, H.D., Marcu, D.: Domain adaptation for statistical classifiers. Journal
of Artificial Intelligence Research (2006)
[19] Finkel, J.R., Manning, C.D.: Nested named entity recognition. Proceed-
ings of EMNLP 2009 (2009)
[20] Finkel, J.R., Manning, C.D.: Joint parsing and named entity recogni-
tion. Proceedings of the North American Association of Computational
Linguistics (2009)
[21] Bod, R., Scha, R., Sima’an, K.: Data oriented parsing. CSLI Publica-
tions, Stanford University (2003)
[22] Joshi, A., Levy, L., Takahashi, M.: Tree adjunct grammars. Journal of
the Computer and System Sciences 10:1 (1975) 136–163
[23] Bansal, M., Klein, D.: Web-scale features for full-scale parsing.
Proceedings of 49th Annual Meeting of ACL: HLT (2011) 693–702
[24] Clark, A.: Combining distributiona and morphological information for
part of speech induction. Proceedings of the tenth Annual Meeting of
the European Association for Computational Linguistics (EACL) (2003)
59–66
[25] Rose, T., Stevenson, M., Whitehead, M.: The Reuters Corpus Volume
1 - from Yesterday’s News to Tomorrow’s Language Resources. Pro-
ceedings of the 3rd international conference on language resources and
evaluation. (2002)

You might also like