Search code examples
macosunixterminaltext-editorlarge-data

OSX terminal text editing trick for a big file (over 10GB)


I have text files in the rage of 10-50GB. I need to edit the first several lines of these files as follows;

Original;

>Aura head -n 2  042319_S6_L001_R1_001.fastq.recovered 
==> 042319_S6_L001_R1_001.fastq.recovered <==
9�C�{a��e�T�l1�{jz7?\^tZ[1�Wvcb���]zj�\,����~
zT'zT'zT'zT'zT'zT'zT'zT'zT'zT'zTfŌȊ���@hYM�rkdt�t?��av��B�,KII9]�Hϛ�[�ada[�SY�o��|>K�H���k��%���'
                                                                                                 �LDTM&Ãd�XQ@A00165:69:HKJ3YDMXX:1:1101:4390:1266 1:N:0:CATGAACA
AGTTAGCTCACCATGATGAAACAAGACT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00165:69:HKJ3YDMXX:1:1101:4896:1266 1:N:0:CATGAACA
TATCTTGTCACGATACTCAACATGTGGA
+
FFFFFFFFFFF:FFFFFFFFFFF:FFFF
@A00165:69:HKJ3YDMXX:1:1101:6307:1266 1:N:0:CATGAACA

Desired output;

>Aura head -n 2  042319_S6_L001_R1_001.fastq.recovered 
==> 042319_S6_L001_R1_001.fastq.recovered <==
@A00165:69:HKJ3YDMXX:1:1101:4390:1266 1:N:0:CATGAACA
AGTTAGCTCACCATGATGAAACAAGACT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00165:69:HKJ3YDMXX:1:1101:4896:1266 1:N:0:CATGAACA
TATCTTGTCACGATACTCAACATGTGGA
+
FFFFFFFFFFF:FFFFFFFFFFF:FFFF
@A00165:69:HKJ3YDMXX:1:1101:6307:1266 1:N:0:CATGAACA

I tried to do this by nano but it takes for long to load the entire gigabytes of file. I tired to split files by split but merged files have corrupted lines for some reasons. I'd appreciate any pointers or tricks for this.

Updated:

Top two lines look as though they are a part of a file header but there is no binary text file headers in this file. So I guess I am lucky. On the other hand this is not a static text. The length and content of lines differ in each files.


Solution

  • Incorporating @shellter comments and help, the easiest script to get rid of non-ASCII junks and get the desired output that I came up with was following;

    gsed -n 'l0' testfile | gsed 's/.*@/@/' |  gsed '1,2d' | gsed 's/[$]//g'
    

    I use brew install gnu sed (gsed) instead of OSX's BSD sed.