Search code examples
dataframelanguage-agnostic

Efficiently remove duplicated values within a column from a huge file (70GB)


Hello everyone I have a huge tabularfile (76GB) and I would like to remove some row of this file.

Here is the head :

CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei A0A653DM45_Callosobruchus_maculatus_Insect  100.000 45  0   0   4   48  65  109 1.563E-24   98  110
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei Q1HR02_Aedes_aegypti_Insect 100.000 45  0   0   4   48  102 146 1.563E-24   98  147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei Q7PRY1_Anopheles_gambiae_Insect 100.000 45  0   0   4   48  102 146 1.563E-24   98  147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei P25867_Drosophila_melanogaster_Insect   100.000 45  0   0   4   48  102 146 1.563E-24   98  147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei J3JXP2_Dendroctonus_ponderosae_Insect   100.000 45  0   0   4   48  102 146 1.563E-24   98  147

and I would like to use an efficient method (I cannot use pandas since the fact to load the file is very very long). In order to keep only the 2 first duplicated row in the first column.

So for instance if I have a file such as :

Note: the table is tabular separated.

Example ;

CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei     A0A653DM45_Callosobruchus_maculatus_Insect      100.000 45      0       0       4       48      65      109     1.563E-24       98      110
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei     Q1HR02_Aedes_aegypti_Insect     100.000 45      0       0       4       48      102     146     1.563E-24       98      147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei     Q7PRY1_Anopheles_gambiae_Insect 100.000 45      0       0       4       48      102     146     1.563E-24       98      147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei     P25867_Drosophila_melanogaster_Insect   100.000 45      0       0       4       48      102     146     1.563E-24       98      147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei     J3JXP2_Dendroctonus_ponderosae_Insect   100.000 45      0       0       4       48      102     146     1.563E-24       98      147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei     A0A0L7KYD3_Operophtera_brumata_Insect   100.000 45      0       0       4       48      102     146     1.563E-24       98      147
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei     A0A653DM45_Callosobruchus_maculatus_Insect      100.000 45      0       0       4       48      65      109     1.563E-24       98      110
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei     Q1HR02_Aedes_aegypti_Insect     100.000 45      0       0       4       48      102     146     1.563E-24       98      147
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei     Q7PRY1_Anopheles_gambiae_Insect 100.000 45      0       0       4       48      102     146     1.563E-24       98      147
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei     P25867_Drosophila_melanogaster_Insect   100.000 45      0       0       4       48      102     146     1.563E-24       98      147
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei     J3JXP2_Dendroctonus_ponderosae_Insect   100.000 45      0       0       4       48      102     146     1.563E-24       98      147

expected output :

CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei     A0A653DM45_Callosobruchus_maculatus_Insect      100.000 45      0       0       4       48      65      109     1.563E-24       98      110
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei     Q1HR02_Aedes_aegypti_Insect     100.000 45      0       0       4       48      102     146     1.563E-24       98      147
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei     A0A653DM45_Callosobruchus_maculatus_Insect      100.000 45      0       0       4       48      65      109     1.563E-24       98      110
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei     Q1HR02_Aedes_aegypti_Insect     100.000 45      0       0       4       48      102     146     1.563E-24       98      147

Solution

  • awk to the rescue!

    If your file is sorted as in the example, this script doesn't take any memory.

    If the file is tab separated and first field doesn't have any spaces you don't need to specify the field separator

    $ awk 'p!=$1{c=2; p=$1} c&&c--' file
    

    after each key change capture the new key and print 2 lines.