I was hoping someone may be able to see where I am failing here. So I have scraped some data from buzzfeed and now I am trying to format a text file with which I can then send into data_convert_examples text_to_data formatter.
I thought I had the answer a couple times, but I am still running up against a brick wall when I process this as binary and then try to train against the data.
What I did was run the binary_to_text on the toy dataset and then opened the file in notepad++ under windows, showing all characters, and matched what I believed to be the format.
I appologize for the long function below, but I really am unsure as to where the issue might be and figured this was the best way to provide enough info. Anyone have any ideas or recommendations?
def processPath(self, toPath):
try:
fout = open(os.path.join(toPath, '{}-{}'.format(self.baseName, self.fileNdx)), 'a+')
for path, dirs, files in os.walk(self.fromPath):
for fn in files:
fullpath = os.path.join(path, fn)
if os.path.isfile(fullpath):
#with open(fullpath, "rb") as f:
with codecs.open(fullpath, "rb", 'ascii', "ignore") as f:
try:
finalRes = ""
content = f.readlines()
self.populateVocab(content)
sentences = sent_tokenize((content[1]).encode('ascii', "ignore").strip('\n'))
for sent in sentences:
textSumFmt = self.textsumFmt
finalRes = textSumFmt["artPref"] + textSumFmt["sentPref"] + sent.replace("=", "equals") + textSumFmt["sentPost"] + textSumFmt["postVal"]
finalRes += (('\t' + textSumFmt["absPref"] + textSumFmt["sentPref"] + (content[0]).strip('\n').replace("=", "equals") + textSumFmt["sentPost"] + textSumFmt["postVal"]) + '\t' +'publisher=BUZZ' + os.linesep)
if self.lineNdx != 0 and self.lineNdx % self.lines == 0:
fout.close()
self.fileNdx+=1
fout = open(os.path.join(toPath, '{}-{}'.format(self.baseName, self.fileNdx)), 'a+')
fout.write( ("{}").format( finalRes.encode('utf-8', "ignore") ) )
self.lineNdx+=1
except RuntimeError as e:
print "Runtime Error: {0} : {1}".format(e.errno, e.strerror)
finally:
fout.close()
After further analysis, it seems that the source of the problem is more with the source data and the way it is constructed rather than data_convert_example.py itself. I'm closing this as the heading is not in-line with the source of the issue.
I found the source of my problem was that I had a space between "Article" and the equals sign. After removing that I was able to successfully train.