Search code examples
nlpword-embeddingfasttext

fasttext: why do aligned vectors contain only one value per word?


I was taking a look at the Fasttext aligned vectors of some languages and was surprised to find that each vectors consisted of one value only. I was expecting a Matrix witch multidimensional vectors belonging to each word, instead there is only one column of numbers. I'm very new to this field and was wondering if somebody could explain to me, how this single number belongig to each word came to be and wether I'm looking at a semantic space as I was expecting or something different (if so what is it and are alingend multidimensional semantic spaces available somewhere?)


Solution

  • I think you may be misinterpreting those files.

    When I look at one of those files – for example wiki.en.align.vec – each line is a word-token, then 300 different values (to provide a 300-dimensional word-vector).

    For example, the 4th line of the file is:

    the -0.0324 -0.0462 -0.0087 0.0994 0.0147 -0.0198 -0.0811 -0.0362 0.0445 0.0402 -0.0199 -0.1173 0.0906 -0.0304 -0.0320 -0.0374 -0.0249 -0.0099 0.0017 0.0719 -0.0834 0.0382 -0.1141 -0.0288 -0.0666 -0.0365 -0.0006 0.0098 0.0282 0.0310 -0.0773 0.0755 -0.0528 0.1225 -0.0138 -0.0879 0.0036 -0.0593 0.0416 -0.0588 0.0266 -0.0011 -0.0419 0.0141 0.0388 -0.0597 -0.0203 0.0444 0.0253 -0.0316 0.0352 -0.0318 -0.0473 0.0347 -0.0250 0.0289 0.0426 0.0218 -0.0254 0.0486 -0.0252 -0.0904 0.1607 -0.0379 0.0231 -0.0988 -0.1213 -0.0926 -0.1116 0.0345 -0.1856 -0.0409 0.0306 -0.0653 -0.0377 -0.0301 0.0361 0.1212 0.0105 -0.0354 0.0552 0.0363 -0.0427 0.0555 -0.0031 -0.0830 -0.0325 0.0415 -0.0461 -0.0615 -0.0412 0.0060 0.1680 -0.1347 0.0271 -0.0438 0.0364 0.0121 0.0018 -0.0138 -0.0625 -0.0161 -0.0009 -0.0373 -0.1009 -0.0583 0.0038 0.0109 -0.0068 0.0319 -0.0043 -0.0412 -0.0506 -0.0674 0.0426 -0.0031 0.0788 0.0924 0.0559 0.0449 0.1364 0.1132 -0.0378 0.1060 0.0130 0.0349 0.0638 0.1020 0.0459 0.0634 -0.0870 0.0447 -0.0124 0.0167 -0.0603 0.0297 -0.0298 0.0691 -0.0280 0.0749 0.0474 0.0275 0.0255 0.0184 0.0085 0.1116 0.0233 0.0176 0.0327 0.0471 0.0662 -0.0353 -0.0387 -0.0336 -0.0354 -0.0348 0.0157 -0.0294 0.0710 0.0299 -0.0602 0.0732 -0.0344 0.0419 0.0773 0.0119 -0.0550 0.0377 0.0808 -0.0424 -0.0977 -0.0386 -0.0334 -0.0384 -0.0520 0.0641 0.0049 0.1226 -0.0011 -0.0131 0.0224 0.0138 -0.0243 0.0544 -0.0164 0.1194 0.0916 -0.0755 0.0565 0.0235 -0.0009 -0.0818 0.0953 0.0873 -0.0215 0.0240 -0.0271 0.0134 -0.0870 0.0597 -0.0073 -0.0230 -0.0220 0.0562 -0.0069 -0.0796 -0.0118 0.0059 0.0221 0.0509 0.1175 0.0508 -0.0044 -0.0265 0.0328 -0.0525 0.0493 -0.1309 -0.0674 0.0148 -0.0024 -0.0163 -0.0241 0.0726 -0.0165 0.0368 -0.0914 0.0197 0.0018 -0.0149 0.0654 0.0912 -0.0638 -0.0135 -0.0277 -0.0078 0.0092 -0.0477 0.0054 -0.0153 -0.0411 -0.0177 0.0874 0.0221 0.1040 0.1004 0.0595 -0.0610 0.0650 -0.0235 0.0257 0.1208 0.0129 -0.0086 -0.0846 0.1102 -0.0338 -0.0553 0.0166 -0.0602 0.0128 0.0792 -0.0181 0.0046 -0.0548 -0.0394 -0.0546 0.0425 0.0048 -0.1172 -0.0925 -0.0357 -0.0123 0.0371 -0.0142 0.0157 0.0442 0.1186 0.0834 -0.0293 0.0313 -0.0287 0.0095 0.0080 0.0566 -0.0370 0.0257 0.1032 -0.0431 0.0544 0.0323 -0.1076 -0.0187 0.0407 -0.0198 -0.0255 -0.0505 0.0827 -0.0650 0.0176
    

    Thus every one of the 2,519,370 word-tokens has a 300-dimensional vector.

    If this isn't what you're seeing, you should explain further. If this is what you're seeing and you were expecting something else, you should explain further what you were expecting.