Hello everyone I have a huge tabularfile (76GB) and I would like to remove some row of this file.
Here is the head :
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei A0A653DM45_Callosobruchus_maculatus_Insect 100.000 45 0 0 4 48 65 109 1.563E-24 98 110
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei Q1HR02_Aedes_aegypti_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei Q7PRY1_Anopheles_gambiae_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei P25867_Drosophila_melanogaster_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei J3JXP2_Dendroctonus_ponderosae_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
and I would like to use an efficient method (I cannot use pandas since the fact to load the file is very very long). In order to keep only the 2 first duplicated row in the first column.
So for instance if I have a file such as :
Note: the table is tabular separated
.
Example ;
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei A0A653DM45_Callosobruchus_maculatus_Insect 100.000 45 0 0 4 48 65 109 1.563E-24 98 110
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei Q1HR02_Aedes_aegypti_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei Q7PRY1_Anopheles_gambiae_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei P25867_Drosophila_melanogaster_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei J3JXP2_Dendroctonus_ponderosae_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei A0A0L7KYD3_Operophtera_brumata_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei A0A653DM45_Callosobruchus_maculatus_Insect 100.000 45 0 0 4 48 65 109 1.563E-24 98 110
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei Q1HR02_Aedes_aegypti_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei Q7PRY1_Anopheles_gambiae_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei P25867_Drosophila_melanogaster_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei J3JXP2_Dendroctonus_ponderosae_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
expected output :
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei A0A653DM45_Callosobruchus_maculatus_Insect 100.000 45 0 0 4 48 65 109 1.563E-24 98 110
CM009916.1_1:1212271-1212415(+):Neodiprion_lecontei Q1HR02_Aedes_aegypti_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei A0A653DM45_Callosobruchus_maculatus_Insect 100.000 45 0 0 4 48 65 109 1.563E-24 98 110
CM009917.1_1:1212271-1212415(+):Neodiprion_lecontei Q1HR02_Aedes_aegypti_Insect 100.000 45 0 0 4 48 102 146 1.563E-24 98 147
awk
to the rescue!
If your file is sorted as in the example, this script doesn't take any memory.
If the file is tab separated and first field doesn't have any spaces you don't need to specify the field separator
$ awk 'p!=$1{c=2; p=$1} c&&c--' file
after each key change capture the new key and print 2 lines.