Search code examples
nlpword-embeddingfasttext

Subword vector in fastText?


I can't figure out what a subword input vector is. I read in the newspaper that the subword is hashed, the subword is the hash code, hash code is a number, not a vector

Ex: Input vector of word eating is [0,0,0,1,0,0,0,0,0] So what is the input vector of subwords "eat", "ati", "ing",...?

Link paper: https://arxiv.org/pdf/1607.04606.pdf

enter image description here

the subword is the hash code, hash code is a number, not a vector


Solution

  • The FastText subwords are, as you've suggested, fragments of the full word. For the purposes of subword creation, FastText will also prepend/append special start-of-word and end-of-word characters. (If I recall correctly, it uses < & >.)

    So, for the full word token 'eating', it is considered as '<eating>'.

    All the 3-character subwords would be '<ea', 'eat', 'ati', 'tin', 'ing', 'ng>'.

    All the 4-character subwords would be '<eat', 'atin', 'ting', 'ing>'.

    All the 5-character subwords would be '<eati', 'ating', 'ting>'.

    I see you've written out a "one-hot" representation of the full word 'eating'[0,0,0,1,0,0,0,0,0] – as if 'eating' is the 4th word in a 9-word vocabulary. While diagrams & certain ways of thnking about the underlying model may consider such a one-hot vector, it's useful to realize that in actual code implementations, such a sparse one-hot vector for words is never actually created.

    Instead, it's just represented as a single number – the index to the non-zero number. That's used as a lookup into an array of vectors of the configured 'dense' size, returning one input word-vector of that size for the word.

    For example, imagine you have a model with a 1-million word known vocabulary, which offers 100-dimensional 'dense embedding' word-vectors. The word 'eating' is the 543,210th word.

    That model will have an array of input-vectors that's has one million slots, and each slot has a 100-dimensional vector in it. We could call it word_vectors_in. The word 'eating''s vector will be at word_vectors_in[543209] (beccause the 1st vector is at word_vectors_in[0]).

    At no point during the creation/training/use of this model will an actual 1-million-long one-hot vector for 'eating' be created. Most often, it'll just be referred-to inside the code as the word-index 543209. The model will have a helper lookup dictionary/hashmap, let's call it word_index that lets code find the right slot for a word. So word_index['eating'] will be 543209.

    OK, now to your actual question, about the subwords. I've detailed how the the single vectors per one known full word are stored, above, in order to contrast it with the different way subwords are handled.

    Subwords are also stored in a big array of vectors, but that array is treated as a collision-oblivious hashtable. That is, by design, many subwords can and do all reuse the same slot.

    Let's call that big array of subword vectors subword_vector_in. Let's also make it 1 million slots long, where each slot has a 100-dimensional vector.

    But now, there is no dictionary that remembers which subwords are in which slots - for example, remembering that subword '<eat' is in arbitrary slot 78789.

    Instead, the string '<eat' is hashed to a number, that number is restricted to the possible indexes into the subwords, and the vector at that index, let's say it's 12344, is used for the subword.

    And then when some other subword comes along, maybe '<dri', it might hash to the exact-same 12344 slot. And that same vector then gets adjusted for that other subword (during training), or returned for both those subwords (and possibly many others) during later FastText-vector synthesis from the finali model.

    Notably, now even if there are far more than 1-million unique subwords, they can all be represented inside that single 1-million slot array, albeit with collisions/interference.

    In practice, the collisions are tolerable because many collisions from very-rare subwords essentially just fuzz slots with lots of random noise that mostly cancels out. For the most-common subwords, that tend to carry any unique meaning because of the way word-roots/prefixes/suffixes hint at word meaning in English & similar langauges, those very-common examples overpower the other noise, and ensure that slot, for at least one or more of its most-common subwords, carries at least some hint of the subword's implied meaning(s).

    So when FastText assembles its final word-vector, by adding:

    word_vector_in[word_index['eating']]  # learned known-word vector
    + subword_vector_in[slot_hash('<ea')]  # 1st 3-char subword
    + subword_vector_in[slot_hash('eat')]
    + subword_vector_in[slot_hash('ati')]
    ... # other 3-char subwords
    ... # every 4-char subword
    ... # other 5-char subwords
    + subword_vector_in[slot_hash('ting>')]  # last 5-char subword
    

    …it gets something that's dominated by the (likely stronger-in-magnitude) known full-word vector, with some useful hints of meaning also contributed by the (probably lower-magnitude) many noisy subword vectors.

    And then if we were to imagine that some other word that's not part of the known 1-million word vocabulary comes along, say 'eatery', it has nothing from word_vector_in for the full word, but it can still do:

    subword_vector_in[slot_hash('<ea')]  # 1st 3-char subword
    + subword_vector_in[slot_hash('eat')]  
    + subword_vector_in[slot_hash('ate')]
    ... # other 3-char subwords
    ... # every 4-char subword
    ... # other 5-char subwords
    + subword_vector_in[slot_hash('tery>')]  # last 5-char subword
    

    Because at least a few of those subwords likely include some meaningful hints of the meaning of the word 'eatery' – especially meanings around 'eat' or even the venue/vendor aspects of the suffix -tery, this synthesized guess for an out-of-vocabulary (OOV) word will be better than a random vector, & often better than ignoring the word entirely in whatever upper-level process is using the FastText vectors.