Search code examples
pythonlinereadlineswritefile

Verify if previous line have same string than current line and sum value of another column


What i'm trying to do is that such script reads the current file:

chr1,700244,714068,LOC100288069,982
chr1,1568158,1570027,MMP23A,784
chr1,1567559,1570030,MMP23A,784
chr1,1849028,1850740,TMEM52,799
chr1,2281852,2284100,LOC100129534,934
chr1,2281852,2284100,LOC100129534,800
chr1,2460183,2461684,HES5,819
chr1,2460183,2461684,HES5,850
chr1,2517898,2522908,FAM213B,834
chr1,2518188,2522908,FAM213B,834
chr1,2518188,2522908,FAM213B,834
chr1,2518188,2522908,FAM213B,834
chr1,2517898,2522908,FAM213B,834

if the column 3 repeats in line, sum the value of colum 4 and yield the mean value of such sum. The output should be:

chr1,700244,714068,LOC100288069,982
chr1,1568158,1570027,MMP23A,784
chr1,1849028,1850740,TMEM52,799
chr1,2281852,2284100,LOC100129534,934
chr1,2460183,2461684,HES5,834.5
chr1,2517898,2522908,FAM213B,867

I tried this script but its not working. Could anyone give me some tip?

f1 = open('path', 'r')

reader1 = f1.read()

f3 = open('path/B_Media.txt','wb')

for line1 in f1:

    coluna = line1.split(',')
    chr = coluna[0]
    start = coluna[1]
    end = coluna[2]
    gene = coluna[3]
    valor_B = coluna[4]
    previous_line = current_line
    current_line = line
    gene2 = previous_line[3]
    soma_B2 = previous_line[4]
    soma_de_B = int(valor_B)+int(soma_B2)
    if gene == gene2:
            x += 1
            media_gene = soma_de_B/x
            output = chr + "," + start + "," + end + "," + gene + "," +valor_B+","+media_gene
            f3.write(output)
            f3.flush()
            print output

Solution

  • As you need to know what happens next (speaking in a way of reading line by line), I would split up the reading and the writing into two different parts.

    Also, the csv-module might come in handy, as you don't have to deal with any special cases (like commas in the text etc.) and reading/writing is really easy. It is generally a good practice to open files with with, because closing it is automatically handled.

    Now for some code :-)

    from __future__ import division
    import csv
    
    gene = 3
    valor_B = 4
    
    data = []
    with open('data.csv', 'r') as readfile:
        reader = csv.reader(readfile)
        for row in reader:
            data.append(row)
    
    values_to_add = []
    with open('B_Media.txt','wb') as writefile:
        writer = csv.writer(writefile)
    
        for i in range(len(data)):
            values_to_add.append(int(data[i][valor_B]))
            # if last row or row is different from previous, write it
            if i == len(data)-1 or data[i][gene] != data[i+1][gene]:
                data[i][valor_B] = sum(values_to_add)/len(values_to_add)
                writer.writerow(data[i])
                values_to_add = []
    

    Basically it first reads everything from the input file and puts it into data. Then, with the output file, it goes through every line, doing the following:

    • Add the value from column 4, which we will eventually write (mabye not now in this pass, but eventually), to a list of values to write
    • If we came to a line which is either different than the previous or the last line (we need to catch that one too!), write to output. If we do this, we take the mean of our list of values which we collected so far (at least 1, maybe 2 or more). We calculate the mean by using sum()/len() and replace the corresponding column with our new value, then write it to the output file.
    • If this is not the case, do nothing! The value from column 4 is already added to the list in the first step, so we can just go one step ahead to the next row.

    Result:

    chr1,700244,714068,LOC100288069,982.0
    chr1,1567559,1570030,MMP23A,784.0
    chr1,1849028,1850740,TMEM52,799.0
    chr1,2281852,2284100,LOC100129534,867.0
    chr1,2460183,2461684,HES5,834.5
    chr1,2517898,2522908,FAM213B,834.0
    

    (You might recognize the from __future__ import division statement, which makes sure we can have non-integer values when dividing, like 834.5.)