A topic that often comes up on the discussions forum is spaCy's Vocab object and its vectors. So let's go over a few properties

A topic that often comes up on the discussions forum is spaCy's Vocab object and its vectors. So let's go over a few properties

A topic that often comes up on the discussions forum is spaCy's Vocab object and its vectors. So let's go over a few properties of vectors currently found in medium (md) and large (lg) models.

No alt text provided for this image

One of the main features of the Vocab in spaCy is the vector store. This is the single place where pre-trained word-embeddings can be found. Having a single place for these vectors saves on a lot of memory!

No alt text provided for this image

It's *not* always the case that each word stored in the vector lookup has a unique vector though! spaCy allows for some pruning in order to save on disk/memory. When two strings have similar vectors, they may get merged together. The medium (md) models typically do this.

No alt text provided for this image

You can inspect the meta information of the spaCy models to get an impression of how much pruning has been done. The large/medium models currently always have the same number of keys, but they differ in the number of vectors.

No alt text provided for this image

If you're curious: you can actually look for pruned vectors by looping over the vectors table.

No alt text provided for this image

The small (sm) spaCy models don't ship with vectors. When you call .vector on these tokens you still get a numeric vector but it's a fallback to the internal Tok2Vec tensor. More details on the difference are discussed here.

No alt text provided for this image

It's also important to understand that these vectors do *not* carry any context. The same string in a sentence may have multiple meanings and the .vector property does not catch this!

No alt text provided for this image

These vector tables can be used to calculate similarity statistics but they are also used to determine if a word is "out of vocabulary" via the .is_oov property. If a string does not appear in the vector table the .is_oov property returns True.

No alt text provided for this image

You might think that the vectors table is a dictionary that matches strings to vectors. Practically, it does. But that's not how it's implemented internally! If you ask for the .keys() you get hash values instead of strings.

No alt text provided for this image

This is where the StringStore makes an appearance. This object handles all the translation from hash to string and from string to hash. Using hashes makes everything much faster and lighter, so we need an object to handle the translation.

No alt text provided for this image

You typically won't interact with this StringStore yourself because it's more of an implementation detail, but it's good to understand that there's a mechanism that deals with the translation between hash and string.

It deserves repeating: the StringStore does *not* determine if a word is OOV! There can be strings in the StringStore that don't have vectors. The StringStore is really just an object that looks up strings by 64-bit hashes.

With the Vocab's StringStore and Vectors, Token objects can fetch lexical properties from a single place in memory. This helps keep things lightweight/fast.

No alt text provided for this image

We hope this thread helped explain some internal details! We might be interested in doing more of these long threads in the future. So if there are general topics that you'd like to see explored in more detail, let us know!

If you need help with an NLP pipeline that utilizes spaCy, we are happy to help you with our new services offering, spaCy Tailored Pipelines. The spaCy team will build you a custom natural language processing pipeline, delivered in a standardized format using spaCy’s projects system.

Martin K.

Tech Lead GenAI | ML Engineer @ KPN

2y
Philip Vollet

Head of Developer Growth at Weaviate

2y

If you want to understand spaCy better you should check this awesome article by 👋 Vincent Warmerdam

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics