Search code examples
pythoncsvnormalize

dividing elements in list to normalize data in Python


I am trying to write a script in Python which normalizes a dataset by dividing all value elements by the max value element.

This is the script that I have come up with so far:

#!/usr/bin/python

with open("infile") as f:
    cols = [float(row.split("\t")[2]) for row in f.readlines()]
    maxVal = max(cols)
    #print maxVal

    data = []
    with open('infile') as f2:
        for line in f2:                  
            items = line.split() # parse the columns
            tClass, feats, values = items[:3] # parse the columns
            #print items      
            normalizedData = float(values)/float(maxVal)
            #print normalizedData

            with open('outfile', 'wb') as f3:
            output = "\t".join([tClass +"\t"+ feats, str(normalizedData)])
            f3.write(output + "\n")

in which the goal is to take an input file (3 columns tab-separated), such as :

lfr about-kind-of+n+n-the-info-n    3.743562
lfr about+n-a-j+n-a-dream-n 2.544614
lfr about+n-a-j+n-a-film-n  1.290925
lfr about+n-a-j+n-a-j-series-n  2.134124
  1. Look for the maxVal in the third column: in this case is would be 3.743562
  2. Divide all values in the 3rd column by maxVal
  3. Output following desired results:
lfr   about-kind-of+n+n-the-info-n    1
lfr   about+n-a-j+n-a-dream-n 0.67973
lfr   about+n-a-j+n-a-film-n  0.34483
lfr   about+n-a-j+n-a-j-series-n  0.57007

However, what is currently being "outputted" is only a single value, which I am assuming is the first value in the input data divided by the max. Any insight on what is going wrong in my code: why the output is only printing one line? Any possible insight on solutions? Thank you in advance.


Solution

  • As far as I understood your intentions, following does the job. (Minor program flow corrections)

    Also, instead of writing continuously to the file, I instead chose to store what to write & then dump everything to the output file.

    Update - Turns out list creation takes same time as the excess with statement use, so, got rid of it completely. Now, writing continuously to the file, without closing it everytime.

    with open("in.txt") as f:
        cols = [float(row.split()[2]) for row in f.readlines()]
        maxVal = max(cols)
        #print maxVal
    
    data = list()
    f3 = open('out.txt', 'w')
    with open('in.txt') as f2:
        for line in f2:
            items = line.split() # parse the columns
            tClass, feats, values = items[:3] # parse the columns
            #print items
            normalizedData = float(values)/float(maxVal)
            #print normalizedData
    
            f3.write("\t".join([tClass +"\t"+ feats, str(normalizedData), "\n"]))
    f3.close()