Search code examples
pythonpandasdataframeargparsepython-re

How to modify a tsv-file column with Python


I have a GFF3 file (mainly a TSV file with 9 columns) and I'm trying to make some changes in the first column of my file in order to overwrite the modification to the file itself.

The GFF3 file looks like this:

## GFF3 file
## replicon1
## replicon2
replicon_1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon_1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon_2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon_2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

I wrote few lines of code in which I decide a certain symbol to change (e.g. "_") and the symbol I want to replace (e.g. "@"):

import os
import re
import argparse
import pandas as pd

def myfunc() -> tuple:
    ap.add_argument("-f", "--file", help="path to file")
    ap.add_argument("-i", "--input_word",help="Symbol to delete")
    ap.add_argument("-o", "--output_word", help="Symbol to insert")
    return ap.parse_args()
args = myfunc()
my_file = args.file
in_char = args.input_word
out_char = args.output_word

with open (my_file, 'r+') as f:
    rawfl = f.read()
    rawfl = re.sub(in_char, out_char, rawfl)
    f.seek(0)
    f.write(rawfl)
    f.close()

The output is something like this:

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some@gene@1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some@gene@1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some@gene@2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some@gene@2;

As you can see, all the "_" has been changed in "@". I tried to modify the script using pandas in order to apply the modification only to the first column (seqid, here below):

with open (my_file, 'r+') as f:
    genomic_dataframe = pd.read_csv(f, sep="\t", names=['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'])
    id = genomic_dataframe.seqid
    id = str(id) #this is used because re.sub expects strings, not dataframe
    id = re.sub(in_char, out_char, genid)
    f.seek(0)
    f.write(genid)
f.close()

I do not obtain the expected result but something like the seqid column (correctly modified) that is added to file but not overwritten respect the original one.

What I'd like to obtain is something like this:

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

Where the "@" symbol is present only in the first column while the "_" is maintained in the 9th column.

Do you know how to fix this? Thank you all.


Solution

  • If you only want to replace the first occurence of _ by @, you can do it this way without the need to load your file as a dataframe and without the use of any 3rd party lib such as pandas.

    with open('f') as f:
        lines = [line.rstrip() for line in f]
    
    for line in lines:
        # Ignore comments
        if line[0] == '#':
            continue
        line = line.replace('_', '@', 1)
    

    This will return lines which contains

    ## GFF3 file
    ## replicon1
    ## replicon2
    replicon@1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
    replicon@1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
    replicon@2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
    replicon@2  prokka  gene    70  98  .   @   .   ID=some_gene_2;