Search code examples
pythonnlpspacygensim

SciSpacy equivalent of Gensim's functions/parameters


With Gensim, there are three functions I use regularly, for example this one:

model = gensim.models.Word2Vec(corpus,size=100,min_count=5)

The output from gensim, but I cannot understand how to set the size and min_count parameters in the equivalent SciSpacy command of:

model = spacy.load('en_core_web_md')

(The output is a model of embeddings (too big to add here))).

This is another command I regularly use:

model.most_similar(positive=['car'])

and this is the output from gensim/Expected output from SciSpacy:

[('vehicle', 0.7857330441474915),
 ('motorbike', 0.7572781443595886),
 ('train', 0.7457204461097717),
 ('honda', 0.7383008003234863),
 ('volkswagen', 0.7298516035079956),
 ('mini', 0.7158907651901245),
 ('drive', 0.7093928456306458),
 ('driving', 0.7084407806396484),
 ('road', 0.7001082897186279),
 ('traffic', 0.6991947889328003)]

This is the third command I regularly use:

print(model.wv['car'])

Output from Gensim/Expected output from SciSpacy (in reality this vector is length 100):

    [ 1.0942473   2.5680697  -0.43163642 -1.171171    1.8553845  -0.3164575
  1.3645878  -0.5003705   2.912658    3.099512    2.0184739  -1.2413547
  0.9156444  -0.08406237 -2.2248871   2.0038593   0.8751471   0.8953876
  0.2207374  -0.157277   -1.4984075   0.49289042 -0.01171476 -0.57937795...]

Could someone show me the equivalent commands for SciSpacy? For example, for 'gensim.models.Word2Vec' I can't find how to specify the length of the vectors (size parameter), or the minimum number of times the word should be in the corpus (min_count) in SciSpacy (e.g. I looked here and here), but I'm not sure if I'm missing them?


Solution

  • A possible way to achieve your goal would be to:

    1. parse you documents via nlp.pipe
    2. collect all the words and pairwise similarities
    3. process similarities to get the desired results

    Let's prepare some data:

    import spacy
    nlp = spacy.load("en_core_web_md", disable = ['ner', 'tagger', 'parser'])
    

    Then, to get a vector, like in model.wv['car'] one would do:

    nlp("car").vector
    

    To get most similar words like model.most_similar(positive=['car']) let's process the corpus:

    corpus = ["This is a sentence about cars. This a sentence aboout train"
              , "And this is a sentence about a bike"]
    docs = nlp.pipe(corpus)
    
    tokens = []
    tokens_orth = []
    
    for doc in docs:
        for tok in doc:
            if tok.orth_ not in tokens_orth:
                tokens.append(tok)
                tokens_orth.append(tok.orth_)
                
    sims = np.zeros((len(tokens),len(tokens)))
    
    for i, tok in enumerate(tokens):
        sims[i] = [tok.similarity(tok_) for tok_ in tokens]
    

    Then to retrieve top=3 most similar words:

    def most_similar(word, tokens_orth = tokens_orth, sims=sims, top=3):
        tokens_orth = np.array(tokens_orth)
        id_word = np.where(tokens_orth == word)[0][0]
        sim = sims[id_word]
        id_ms = np.argsort(sim)[:-top-1:-1]
        return list(zip(tokens_orth[id_ms], sim[id_ms]))
    
    
    most_similar("This")
    

    [('this', 1.0000001192092896), ('This', 1.0), ('is', 0.5970357656478882)]
    

    PS

    I have also noticed you asked for specification of dimension and frequency. Embedding length is fixed at the time the model is initialized, so it can't be changed after that. You can start from a blank model if you wish so, and feed embeddings you're comfortable with. As for the frequency, it's doable, via counting all the words and throwing away anything that is below desired threshold. But again, underlying embeddings will be from a not filtered text. SpaCy is different from Gensim in that it uses readily available embeddings whereas Gensim trains them.