I have text files in the rage of 10-50GB. I need to edit the first several lines of these files as follows;
Original;
>Aura head -n 2 042319_S6_L001_R1_001.fastq.recovered
==> 042319_S6_L001_R1_001.fastq.recovered <==
9�C�{a��e�T�l1�{jz7?\^tZ[1�Wvcb���]zj�\,����~
zT'zT'zT'zT'zT'zT'zT'zT'zT'zT'zTfŌȊ���@hYM�rkdt�t?��av��B�,KII9]�Hϛ�[�ada[�SY�o��|>K�H���k��%���'
�LDTM&Ãd�XQ@A00165:69:HKJ3YDMXX:1:1101:4390:1266 1:N:0:CATGAACA
AGTTAGCTCACCATGATGAAACAAGACT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00165:69:HKJ3YDMXX:1:1101:4896:1266 1:N:0:CATGAACA
TATCTTGTCACGATACTCAACATGTGGA
+
FFFFFFFFFFF:FFFFFFFFFFF:FFFF
@A00165:69:HKJ3YDMXX:1:1101:6307:1266 1:N:0:CATGAACA
Desired output;
>Aura head -n 2 042319_S6_L001_R1_001.fastq.recovered
==> 042319_S6_L001_R1_001.fastq.recovered <==
@A00165:69:HKJ3YDMXX:1:1101:4390:1266 1:N:0:CATGAACA
AGTTAGCTCACCATGATGAAACAAGACT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00165:69:HKJ3YDMXX:1:1101:4896:1266 1:N:0:CATGAACA
TATCTTGTCACGATACTCAACATGTGGA
+
FFFFFFFFFFF:FFFFFFFFFFF:FFFF
@A00165:69:HKJ3YDMXX:1:1101:6307:1266 1:N:0:CATGAACA
I tried to do this by nano but it takes for long to load the entire gigabytes of file. I tired to split files by split
but merged files have corrupted lines for some reasons. I'd appreciate any pointers or tricks for this.
Updated:
Top two lines look as though they are a part of a file header but there is no binary text file headers in this file. So I guess I am lucky. On the other hand this is not a static text. The length and content of lines differ in each files.
Incorporating @shellter comments and help, the easiest script to get rid of non-ASCII junks and get the desired output that I came up with was following;
gsed -n 'l0' testfile | gsed 's/.*@/@/' | gsed '1,2d' | gsed 's/[$]//g'
I use brew install gnu sed
(gsed) instead of OSX's BSD sed
.