How to modify word2vec code to build embedding for tab-delimited sequence of phrases?

Given text file with lines as follows:

Phrase foo\tPhrase bla\tPhrase blabla\t...
Phrase bar\tPhrase blabla\tPhrase blablabla\t...

where each text line is a tab-delimited sequence of of phrases, which can contain space and other special characters. We are interested in embedding at phrase level, NOT word level.

The current word2vec.c support "space", "tab", "new line" as delimiters. How to disable "space" and enable only "tab" and "new line" as delimiters in word2vec.c in this case?

I got word2vec.c from Tomas Mikolov GitHub

Solution

The line https://github.com/tmikolov/word2vec/blob/20c129af10659f7c50e86e3be406df663beff438/word2vec.c#L80 defines the delimiters in word2vec.c; if you're compiling that file, you could edit that line & re-compile to make it behave differently.

But, it'd be easier and more robust (if in fact you're using some other word2vec implementation) if you simply pre-processed your text to transform it into the expected form. For example, you might change all spaces ' ' to underscores '_' (or some other plug character, if any original underscores are important to keep distinct).

When later interpreting the results, remember to apply the same space-to-underscore transform on lookups, or reverse it by replacing underscore-with-space to display results.