Word2Vec: Effect of window size used

I am trying to train a word2vec model on very short phrases (5 grams). Since each sentence or example is very short, I believe the window size I can use can atmost be 2. I am trying to understand what the implications of such a small window size are on the quality of the learned model, so that I can understand whether my model has learnt something meaningful or not. I tried training a word2vec model on 5-grams but it appears the learnt model does not capture semantics etc very well.

I am using the following test to evaluate the accuracy of model: https://code.google.com/p/word2vec/source/browse/trunk/questions-words.txt

I used gensim.Word2Vec to train a model and here is a snippet of my accuracy scores (using a window size of 2)

[{'correct': 2, 'incorrect': 304, 'section': 'capital-common-countries'},
 {'correct': 2, 'incorrect': 453, 'section': 'capital-world'},
 {'correct': 0, 'incorrect': 86, 'section': 'currency'},
 {'correct': 2, 'incorrect': 703, 'section': 'city-in-state'},
 {'correct': 123, 'incorrect': 183, 'section': 'family'},
 {'correct': 21, 'incorrect': 791, 'section': 'gram1-adjective-to-adverb'},
 {'correct': 8, 'incorrect': 544, 'section': 'gram2-opposite'},
 {'correct': 284, 'incorrect': 976, 'section': 'gram3-comparative'},
 {'correct': 67, 'incorrect': 863, 'section': 'gram4-superlative'},
 {'correct': 41, 'incorrect': 951, 'section': 'gram5-present-participle'},
 {'correct': 6, 'incorrect': 1089, 'section': 'gram6-nationality-adjective'},
 {'correct': 171, 'incorrect': 1389, 'section': 'gram7-past-tense'},
 {'correct': 56, 'incorrect': 936, 'section': 'gram8-plural'},
 {'correct': 52, 'incorrect': 705, 'section': 'gram9-plural-verbs'},
 {'correct': 835, 'incorrect': 9973, 'section': 'total'}]

I also tried running the demo-word-accuracy.sh script outlined here with a window size of 2 and get poor accuracy as well:

Sample output:
    capital-common-countries:
    ACCURACY TOP1: 19.37 %  (98 / 506)
    Total accuracy: 19.37 %   Semantic accuracy: 19.37 %   Syntactic accuracy: -nan % 
    capital-world:
    ACCURACY TOP1: 10.26 %  (149 / 1452)
    Total accuracy: 12.61 %   Semantic accuracy: 12.61 %   Syntactic accuracy: -nan % 
    currency:
    ACCURACY TOP1: 6.34 %  (17 / 268)
    Total accuracy: 11.86 %   Semantic accuracy: 11.86 %   Syntactic accuracy: -nan % 
    city-in-state:
    ACCURACY TOP1: 11.78 %  (185 / 1571)
    Total accuracy: 11.83 %   Semantic accuracy: 11.83 %   Syntactic accuracy: -nan % 
    family:
    ACCURACY TOP1: 57.19 %  (175 / 306)
    Total accuracy: 15.21 %   Semantic accuracy: 15.21 %   Syntactic accuracy: -nan % 
    gram1-adjective-to-adverb:
    ACCURACY TOP1: 6.48 %  (49 / 756)
    Total accuracy: 13.85 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 6.48 % 
    gram2-opposite:
    ACCURACY TOP1: 17.97 %  (55 / 306)
    Total accuracy: 14.09 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 9.79 % 
    gram3-comparative:
    ACCURACY TOP1: 34.68 %  (437 / 1260)
    Total accuracy: 18.13 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 23.30 % 
    gram4-superlative:
    ACCURACY TOP1: 14.82 %  (75 / 506)
    Total accuracy: 17.89 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 21.78 % 
    gram5-present-participle:
    ACCURACY TOP1: 19.96 %  (198 / 992)
    Total accuracy: 18.15 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 21.31 % 
    gram6-nationality-adjective:
    ACCURACY TOP1: 35.81 %  (491 / 1371)
    Total accuracy: 20.76 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 25.14 % 
    gram7-past-tense:
    ACCURACY TOP1: 19.67 %  (262 / 1332)
    Total accuracy: 20.62 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 24.02 % 
    gram8-plural:
    ACCURACY TOP1: 35.38 %  (351 / 992)
    Total accuracy: 21.88 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 25.52 % 
    gram9-plural-verbs:
    ACCURACY TOP1: 20.00 %  (130 / 650)
    Total accuracy: 21.78 %   Semantic accuracy: 15.21 %   Syntactic accuracy: 25.08 % 
    Questions seen / total: 12268 19544   62.77 %

However the word2vec site claims its possible to obtain an accuracy of ~60% on these tasks. Hence I would like to gain some insights into the effect of these hyperparameters like window size and how they affect quality of learnt models.

Solution

To your question: "I am trying to understand what the implications of such a small window size are on the quality of the learned model".

For example "stackoverflow great website for programmers" with 5 words (suppose we save the stop words great and for here) if the window size is 2 then the vector of word "stackoverflow" is directly affected by the word "great" and "website", if the window size is 5 "stackoverflow" can be directly affected by two more words "for" and "programmers". The 'affected' here means it will pull the vector of two words closer.

So it depends on the material you are using for training, if the window size of 2 can capture the context of a word, but 5 is chosen, it will decrease the quality of the learnt model, and vise versa.