I am new to Gensim Word2Vec. I was trying to use Word2Vec to build word vectors for some raw html files. So I first convert the html file into txt file.
When I train the word2vec model, everything is fine. But when I want to test the accuracy of the model by doing
model.accuracy(file_name)
it produced error:
Traceback (most recent call last):
File "build_w2v.py", line 82, in <module>
main()
File "build_w2v.py", line 77, in main
gen_w2v_model()
File "build_w2v.py", line 71, in gen_w2v_model
accuracy = model.accuracy(target)
File "/home/k/shankai/app/anaconda2/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1330, in accuracy
return self.wv.accuracy(questions, restrict_vocab, most_similar, case_insensitive)
File "/home/k/shankai/app/anaconda2/lib/python2.7/site-packages/gensim/models/keyedvectors.py", line 679, in accuracy
raise ValueError("missing section header before line #%i in %s" % (line_no, questions))
ValueError: missing section header before line #0
Below is the sample file:
zGR='ca-about-health_js';var ziRfw=0;zobt=" Vision Ads";zOBT=" Ads";function zIpSS(u){zpu(0,u,280,375,"ssWin")}function zIlb(l,t,f){zT(l,'18/1Pp/wX')}
zWASL=1;zGRH=1
#rs{margin:0 0 10px}#rs #n5{font-weight:bold}#rs a{padding:7px;text-transform:capitalize}Poking Eyelashes - Poking Eyelashes Problem
<!--
zGOW=0;xd=0;zap="";zAth='25752';zAthG='25752';zTt='11';zir='';zBTS=0;zBT=0;zSt='';zGz=''
ch='health';gs='vision';xg="Vision";zcs=''
zFDT='0'
zFST='0'
zOr='BA15WT26OkWA0O1b';zTbO=zRQO=1;zp0=zp1=zp2=zp3=zfs=0;zDc=1;
zSm=zSu=zhc=zpb=zgs=zdn='';zFS='BA110BA0110B00101';zFD='BA110BA0110B00101'
zDO=zis=1;zpid=zi=zRf=ztp=zpo=0;zdx=20;zfx=100;zJs=0;
zi=1;zz=';336280=2-1-1299;72890=2-1-1299;336155=2-1-12-1;93048=2-1-12-1;30050=2-1-12-1';zx='100';zde=15;zdp=1440;zds=1440;zfp=0;zfs=66;zfd=100;zdd=20;zaX=new Array(11, new Array(100,1051,8192,2,'336,300'),7, new Array(100,284,8196,12,'336,400'));zDc=1;;zDO=1;;zD336=1;zhc='';;zGTH=1;
zGo=0;zG=17;zTac=2;zDot=0;
zObT="Vision";zRad=5;var tp=" primedia_"+(zBT?"":"non_")+"site_targeting";if(!this.zGCID)zGCID=tp
else zGCID+=tp;
if(zBT>0){zOBR=1}
if(!this.uy)uy='about.com';if(typeof document.domain!="undefined")document.domain=uy;//-->
function zob(p){if(!this.zOfs)return;var a=zOfs,t,i=0,l=a.length;if(l){w('<div id="oF"><b>'+(this.zobt?zobt:xg+' Ads')+'</b><ul>');while((i<l)&&i<zRad){t=a[i++].line1;w('<li><a href="/z/js/o'+(p?p:'')+'.htm?k='+zUriS(t.toLowerCase())+(this.zobr?zobr:'')+'&d='+zUriS(t)+'&r='+zUriS(zWl)+'" target="_'+(this.zOBNW?'new'+zr(9999):'top')+'">'+t+'</a></li>');}w('</ul></div>')}}function rb600(){if(gEI('bb'))gEI('bb').height=600}zJs=10
zJs=11
zJs=12
zJs=13
zc(5,'jsc',zJs,9999999,'')
zDO=0
So This file actually begins with many (I don't know) space or \n. When I open in the vim.It looks like this.
So what is the problem here?
Also, I am doing text classification of some biomedical papers. The files I was given are all raw html files in either Japanese or English. After I do the ascii conversion and some stop_words cleaning, there are still many HTML code left in the file.
When I try to clean these files and restrict the characters to [a-zA-Z0-9], I found some medical terms like [4protein...] or something get not properly cleaned as well.
Are there any suggestions in how to clean up these files?
The argument to accuracy()
should be a set of analogies to test the model against, in the format of the questions-words.txt
file available from the original word2vec.c distribution. (It should not be your own file.)