I am trying to write a script in Python which normalizes a dataset by dividing all value elements by the max value element.
This is the script that I have come up with so far:
#!/usr/bin/python
with open("infile") as f:
cols = [float(row.split("\t")[2]) for row in f.readlines()]
maxVal = max(cols)
#print maxVal
data = []
with open('infile') as f2:
for line in f2:
items = line.split() # parse the columns
tClass, feats, values = items[:3] # parse the columns
#print items
normalizedData = float(values)/float(maxVal)
#print normalizedData
with open('outfile', 'wb') as f3:
output = "\t".join([tClass +"\t"+ feats, str(normalizedData)])
f3.write(output + "\n")
in which the goal is to take an input file (3 columns tab-separated), such as :
lfr about-kind-of+n+n-the-info-n 3.743562
lfr about+n-a-j+n-a-dream-n 2.544614
lfr about+n-a-j+n-a-film-n 1.290925
lfr about+n-a-j+n-a-j-series-n 2.134124
lfr about-kind-of+n+n-the-info-n 1 lfr about+n-a-j+n-a-dream-n 0.67973 lfr about+n-a-j+n-a-film-n 0.34483 lfr about+n-a-j+n-a-j-series-n 0.57007
However, what is currently being "outputted" is only a single value, which I am assuming is the first value in the input data divided by the max. Any insight on what is going wrong in my code: why the output is only printing one line? Any possible insight on solutions? Thank you in advance.
As far as I understood your intentions, following does the job. (Minor program flow corrections)
Also, instead of writing continuously to the file, I instead chose to store what to write & then dump everything to the output file.
Update - Turns out list
creation takes same time as the excess with
statement use, so, got rid of it completely. Now, writing continuously to the file, without closing it everytime.
with open("in.txt") as f:
cols = [float(row.split()[2]) for row in f.readlines()]
maxVal = max(cols)
#print maxVal
data = list()
f3 = open('out.txt', 'w')
with open('in.txt') as f2:
for line in f2:
items = line.split() # parse the columns
tClass, feats, values = items[:3] # parse the columns
#print items
normalizedData = float(values)/float(maxVal)
#print normalizedData
f3.write("\t".join([tClass +"\t"+ feats, str(normalizedData), "\n"]))
f3.close()