Search code examples
pythonarraysvectorword2vec

How to get doc2vec or sen2vec trained vectors in readable (csv or txt) format linewise?


I trained fasttext or Sen2vec, or word2vec model for my news collection in csv file, were each news have one line like that

0 Trump is a liar.....
1 Europa going for brexit.....
2 Russia is no more world power......

So, I got trained model and now I can happily get vectors for any line in my csv file like that (fasttext)

import csv  
import re

train = open('tweets.train3','w')  
test = open('tweets.valid3','w')  
with open(r'C:\Users\123\Desktop\data\osn-9.csv', mode='r', encoding = "utf- 
 8" ,errors='ignore') as csv_file:  
csv_reader = csv.DictReader(csv_file, fieldnames=['sen', 'text'])
line = 0
for row in csv_reader:
    # Clean the training data
    # First we lower case the text
    text = row["text"].lower()
    # remove links
    text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',text)
    #Remove usernames
    text = re.sub('@[^\s]+','', text)
    text = ' '.join(re.sub("[\.\,\!\?\:\*\(\)\;\-\=]", " ", text).split())
    # replace hashtags by just words
    text = re.sub(r'#([^\s]+)', r'\1',  text)
    #correct all multiple white spaces to a single white space
    text = re.sub('[\s]+', ' ', text)
    # Additional clean up : removing words less than 3 chars, and remove 
    space at the beginning and teh end
    text = re.sub(r'\W*\b\w{1,3}\b', '', text)
    text = text.strip()
    line = line + 1
    # Split data into train and validation
    if line > 8416:
        print(f'__label__{row["sen"]} {text}', file=test)
    else:
        print(f'__label__{row["sen"]} {text}', file=train)
 import fasttext
 hyper_params = {"lr": 0.1,
"epoch": 500,
"wordNgrams": 2,
"dim": 100,
"loss":"softmax"}


model = fasttext.train_supervised(input='tweets.train3',**hyper_params)
model.get_sentence_vector('Trump is a liar.....')
array([-0.20266785,  0.3407566 ,  ...,  0.03044436,  0.39055538], 
dtype=float32).

or like that (gensim)

In [10]:
model.infer_vector(['Trump', 'is', 'a ', 'liar'])
Out[10]:
array([ 0.24116205,  0.07339828, -0.27019867, -0.19452883,  0.126193  ,
 ........................,
    0.09754166,  0.12638392, -0.09281237, -0.04791372,  0.15747668],
  dtype=float32)

But how I can get vectors not as arrays for each line in my csv file? Like that

0  Trump is a liar..... -0.20266785,  0.3407566 ,  ...,  0.03044436,  
1  Europa going for brexit..... 0.24116205,  0.07339828,.... -0.27019867
2  Russia is no more world power...... 0.12638392, -0.09281237 
 ...-0.04791372, 

Or like that

0   -0.20266785,  0.3407566 ,  ...,  0.03044436,  
1   0.24116205,  0.07339828,.... -0.27019867
2   0.12638392, -0.09281237...-0.0479137

Solution

  • The CSV Python library will get you started. The examples are very straight forward, all you should have to do is pass your lists as parameters and make sure it has the correct settings.

    Loose example:

    import csv 
    
    #This should be a list of all the lists that
    #you would like to write into the csv
    master_list = []
    
    with open('mycsv.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')
        for item in master_list:
            writer.writerow(item)
    

    This should at least get you started. I did light testing and it worked for me at the very least.