I trained fasttext or Sen2vec, or word2vec model for my news collection in csv file, were each news have one line like that
0 Trump is a liar.....
1 Europa going for brexit.....
2 Russia is no more world power......
So, I got trained model and now I can happily get vectors for any line in my csv file like that (fasttext)
import csv
import re
train = open('tweets.train3','w')
test = open('tweets.valid3','w')
with open(r'C:\Users\123\Desktop\data\osn-9.csv', mode='r', encoding = "utf-
8" ,errors='ignore') as csv_file:
csv_reader = csv.DictReader(csv_file, fieldnames=['sen', 'text'])
line = 0
for row in csv_reader:
# Clean the training data
# First we lower case the text
text = row["text"].lower()
# remove links
text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','',text)
#Remove usernames
text = re.sub('@[^\s]+','', text)
text = ' '.join(re.sub("[\.\,\!\?\:\*\(\)\;\-\=]", " ", text).split())
# replace hashtags by just words
text = re.sub(r'#([^\s]+)', r'\1', text)
#correct all multiple white spaces to a single white space
text = re.sub('[\s]+', ' ', text)
# Additional clean up : removing words less than 3 chars, and remove
space at the beginning and teh end
text = re.sub(r'\W*\b\w{1,3}\b', '', text)
text = text.strip()
line = line + 1
# Split data into train and validation
if line > 8416:
print(f'__label__{row["sen"]} {text}', file=test)
else:
print(f'__label__{row["sen"]} {text}', file=train)
import fasttext
hyper_params = {"lr": 0.1,
"epoch": 500,
"wordNgrams": 2,
"dim": 100,
"loss":"softmax"}
model = fasttext.train_supervised(input='tweets.train3',**hyper_params)
model.get_sentence_vector('Trump is a liar.....')
array([-0.20266785, 0.3407566 , ..., 0.03044436, 0.39055538],
dtype=float32).
or like that (gensim)
In [10]:
model.infer_vector(['Trump', 'is', 'a ', 'liar'])
Out[10]:
array([ 0.24116205, 0.07339828, -0.27019867, -0.19452883, 0.126193 ,
........................,
0.09754166, 0.12638392, -0.09281237, -0.04791372, 0.15747668],
dtype=float32)
But how I can get vectors not as arrays for each line in my csv file? Like that
0 Trump is a liar..... -0.20266785, 0.3407566 , ..., 0.03044436,
1 Europa going for brexit..... 0.24116205, 0.07339828,.... -0.27019867
2 Russia is no more world power...... 0.12638392, -0.09281237
...-0.04791372,
Or like that
0 -0.20266785, 0.3407566 , ..., 0.03044436,
1 0.24116205, 0.07339828,.... -0.27019867
2 0.12638392, -0.09281237...-0.0479137
The CSV Python library will get you started. The examples are very straight forward, all you should have to do is pass your lists as parameters and make sure it has the correct settings.
Loose example:
import csv
#This should be a list of all the lists that
#you would like to write into the csv
master_list = []
with open('mycsv.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for item in master_list:
writer.writerow(item)
This should at least get you started. I did light testing and it worked for me at the very least.