Search code examples
pythongff

Renaming Name ID in gffile.


I have a gff file looks like this:

contig1 loci    gene    452050  453069  15  -   .   ID=dd_g4_1G94;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_g4_1G94.1;Parent=dd_g4_1G94
contig1 loci    exon    452050  452543  .   -   .   ID=dd_g4_1G94.1.exon1;Parent=dd_g4_1G94.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_g4_1G94.1.exon2;Parent=dd_g4_1G94.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_g4_1G94.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_g4_1G94.2.exon1;Parent=dd_g4_1G94.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_g4_1G94.2.exon2;Parent=dd_g4_1G94.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_g4_1G94.2.exon3;Parent=dd_g4_1G94.2
### 

I wish to rename the ID names, starting from 0001, such that for the above gene the entry is:

contig1 loci    gene    452050  453069  15  -   .   ID=dd_0001;
contig1 loci    mRNA    452050  453069  14  -   .   ID=dd_0001.1;Parent=dd_0001
contig1 loci    exon    452050  452543  .   -   .   ID=dd_0001.1.exon1;Parent=dd_0001.1
contig1 loci    exon    452592  453069  .   -   .   ID=dd_0001.1.exon2;Parent=dd_0001.1
contig1 loci    mRNA    452153  453069  15  -   .   ID=dd_0001.2;Parent=dd_g4_1G94
contig1 loci    exon    452153  452543  .   -   .   ID=dd_0001.2.exon1;Parent=dd_0001.2
contig1 loci    exon    452592  452691  .   -   .   ID=dd_0001.2.exon2;Parent=dd_0001.2
contig1 loci    exon    452729  453069  .   -   .   ID=dd_0001.2.exon3;Parent=dd_0001.2 

The above example is simply for one gene entry, but I wish to rename all genes, and their corresponding mRNA/exon, consecutively starting from ID = dd_0001. Any hints on how to do this would be much appreciated.


Solution

  • The file needs to be opened, then the id replaced line by line.
    Here is the docs reference for file I/O and str.replace().

    gff_filename = 'filename.gff'
    replace_string = 'dd_g4_1G94'
    replace_with = 'dd_0001'
    
    lines = []
    with open(gff_filename, 'r') as gff_file:
        for line in gff_file:
            line = line.replace(replace_string, replace_with)
            lines.append(line)
    
    with open(gff_filename, 'w') as gff_file:
        gff_file.writelines(lines)
    

    Tested in Windows 10, Python 3.5.1, this works.

    To search for ids, you should use regex.

    import re
    
    gff_filename = 'filename.gff'
    replace_with = 'dd_{}'
    re_pattern = r'ID=(.*?)[;.]'
    
    ids  = []
    lines = []
    with open(gff_filename, 'r') as gff_file:
        file_lines = [line for line in gff_file]
    
    for line in file_lines:
        matches = re.findall(re_pattern, line)
        for found_id in matches:
            if found_id not in ids:
                ids.append(found_id)
    
    for line in file_lines:
        for ID in ids:
            if ID in line:
                id_suffix = str(ids.index(ID)).zfill(4)
                line = line.replace(ID, replace_with.format(id_suffix))
        lines.append(line)
    
    with open(gff_filename, 'w') as gff_file:
        gff_file.writelines(lines)
    

    There are other ways of doing this, but this is quite robust.