Search code examples
linuxbioinformaticsgenetics

Сalculating gene length by coordinates


I received from my colleagues a list of thousends genes with coordinates. And it looks like this:

NPHP4   Nephronophthisis 4, 606966 (3), Autosomal recessive; Senior-Loken syndrome 4, 606996 (3), Autosomal recessive   1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
ESPN    Deafness, autosomal recessive 36, 609006 (3), Autosomal recessive; Deafness, neurosensory, without vestibular involvement, autosomal dominant (3)       1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
PLEKHG5 Charcot-Marie-Tooth disease, recessive intermediate C, 615376 (3), Autosomal recessive; Spinal muscular atrophy, distal, autosomal recessive, 4, 611067 (3), Autosomal recessive        1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
PARK7   Parkinson disease 7, autosomal recessive early-onset, 606324 (3), Autosomal recessive   1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036

There are coordinates in the third column, that starts with number of the chromosome, followed with the start position and end position, divided with ":". And if several regions for one gene, they are separeted with ",":

1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036

I need to count the lengths of the region, i.e. difference between end and start positions (or their sum, if several regions for one gene) for each gene (each line), but the number of regions is different across all lines. I tried to count this in Excel, but the number of fragments is too large and not even displayed in some cases. Is there any way to calculate this for each line with some regular expression?

I expect the output as the fourth column. For example, if the third column:

1:1167623:1168684

I expect:

1:1167623:1168684 1061

If the column with coordinates:

1:11907145:11907520,1:11906035:11906116,1:11907590:11907770

I expect:

1:11907145:11907520,1:11906035:11906116,1:11907590:11907770 636

Thanks a lot


Solution

  • One can do this fairly simply with python. I have provided commented code below.

    d = """\
    NPHP4   Nephronophthisis 4, 606966 (3), Autosomal recessive; Senior-Loken syndrome 4, 606996 (3), Autosomal recessive   1:6021825:6022054,1:6008105:6008352,1:6046180:6046368,1:5937125:5937385,1:6012735:6012908,1:5993185:5993432,1:5934495:5934756,1:5950905:5951117,1:5927765:5927985,1:5965330:5965582,1:5934905:5935193,1:6007135:6007317,1:5947315:5947565,1:6027325:6027445,1:5969190:5969291,1:5923920:5924129,1:5940145:5940333,1:5964645:5964898,1:5987685:5987868,1:5925130:5925361,1:6038305:6038513,1:5923300:5923503,1:5965665:5965876,1:5967145:5967318,1:5933280:5933439,1:5924375:5924620,1:5927065:5927202,1:5926410:5926553,1:6029125:6029336
    ESPN    Deafness, autosomal recessive 36, 609006 (3), Autosomal recessive; Deafness, neurosensory, without vestibular involvement, autosomal dominant (3)       1:6508675:6509172,1:6510480:6510561,1:6505700:6506011,1:6504515:6504761,1:6488250:6488500,1:6500280:6500530,1:6520035:6520244,1:6500660:6500893,1:6511640:6511845,1:6517250:6517357,1:6484990:6485353,1:6517385:6517459,1:6500970:6501143,1:6511860:6512157
    PLEKHG5 Charcot-Marie-Tooth disease, recessive intermediate C, 615376 (3), Autosomal recessive; Spinal muscular atrophy, distal, autosomal recessive, 4, 611067 (3), Autosomal recessive        1:6529360:6529539,1:6531525:6531730,1:6532565:6532713,1:6530270:6530441,1:6556990:6557124,1:6529070:6529187,1:6527595:6527666,1:6533285:6533532,1:6537565:6537735,1:6529210:6529330,1:6534485:6534657,1:6579480:6579584,1:6533020:6533263,1:6545355:6545534,1:6531795:6531909,1:6527860:6528675,1:6530540:6530718,1:6534050:6534264,1:6535500:6535600,1:6531025:6531194,1:6530770:6530978,1:6556530:6556669,1:6535085:6535220,1:6536025:6536128,1:6529575:6529755,1:6557350:6557420
    PARK7   Parkinson disease 7, autosomal recessive early-onset, 606324 (3), Autosomal recessive   1:8025355:8025499,1:8044895:8045137,1:8022820:8022967,1:8037705:8037817,1:8021880:8021956,1:8029380:8029481,1:8030930:8031036
    FOOBAR 1:11907145:11907520,1:11906035:11906116,1:11907590:11907770
    """
    
    gene_rows = d.splitlines()
    
    for gene_row in gene_rows:
        # Name like "NPHP4"
        gene_name = gene_row.split()[0]
        # List like ["1:6021825:6022054", "1:6008105:6008352", ...]
        regions = gene_row.split()[-1].split(",")
        # Counter to hold our total gene length.
        gene_length = 0
        for region in regions:
            # Split "1:6021825:6022054" into "1", "6021825", and "6022054"
            chromosome, start, end = region.split(":")
            # Update the gene length counter with this region's length.
            region_length = int(end) - int(start)
            gene_length += region_length
        print(gene_name, gene_length)
    

    The output is

    NPHP4 5984
    ESPN 3296
    PLEKHG5 4685
    PARK7 928
    FOOBAR 636