Search code examples
sedgreplinegff

Replace multiple lines in one file with the same lines at the same line numbers in another file?


I have a modified gff file, and it is missing some lines that are present in the original gff file. I want to add those back in.

i.e., original gff file with extra lines "# Fasta ..." and "##sequence-region" included prior to each new contig:

1     # Fasta definition line: >contig00047
2     ##sequence-region
3     contig00047 AUGUSTUS old annotation
4     contig00047 AUGUSTUS old annotation
5     contig00047 AUGUSTUS old annotation
6     contig00047 AUGUSTUS old annotation
7     contig00047 AUGUSTUS old annotation
8     contig00047 AUGUSTUS old annotation
9     # Fasta definition line: >contig00048
10   ##sequence-region
11   contig00048 AUGUSTUS old annotation
12   contig00048 AUGUSTUS old annotation
13   contig00048 AUGUSTUS old annotation
14   contig00048 AUGUSTUS old annotation

And here is the new modified gff file format, missing those extra lines:

1     contig00047 AUGUSTUS new annotation
2     contig00047 AUGUSTUS new annotation
3     contig00047 AUGUSTUS new annotation
4     contig00047 AUGUSTUS new annotation
5     contig00047 AUGUSTUS new annotation
6     contig00047 AUGUSTUS new annotation
7     contig00048 AUGUSTUS new annotation
8     contig00048 AUGUSTUS new annotation
9     contig00048 AUGUSTUS new annotation
10   contig00048 AUGUSTUS new annotation

And this is what I want:

1     # Fasta definition line: >contig00047
2     ##sequence-region
3     contig00047 AUGUSTUS new annotation
4     contig00047 AUGUSTUS new annotation
5     contig00047 AUGUSTUS new annotation
6     contig00047 AUGUSTUS new annotation
7     contig00047 AUGUSTUS new annotation
8     contig00047 AUGUSTUS new annotation
9     # Fasta definition line: >contig00048
10   ##sequence-region
11   contig00048 AUGUSTUS new annotation
12   contig00048 AUGUSTUS new annotation
13   contig00048 AUGUSTUS new annotation
14   contig00048 AUGUSTUS new annotation

I had brought in the original file into R and updated the annotations, but it lost the lines that began with '#'. I need those back in for my gff to be valid. I tried using using grep to get the line numbers for all the lines in the orignal gff that began with #:

$ grep -n "#' Renamed_Blast2GO_gff_without_contig.gff | cut -f1 -d: > line.txt

Then I opened line.txt in gedit and searched and replaced all \n' with G; to get one long string in line 1. Then I added empty lines in the modified gff file after each line number specified in the line 1 string using sed:

$ sed '<\paste line 1 string here>' mod2_gff.gff
i.e.,
$ sed '1G;2G;9G;10G' mod2_gff.gff # My file is actually really big, so this gets quite long, but still works.

Now I want to replace the empty lines in the modified file, with the lines in the original file. I have tried various things, but haven't been able to get it to work. The string "##sequence-region" is not unique, and so doing a key-value set-up won't work in this case. I'm not sure if it would be possible to query line by line, and see when the next line has a new contig number, and then insert two lines above it with a matching # Fasta definition line, and the ##sequence-region line?

Thank-you all for any help you can provide!


Solution

  • awk to the rescue!

    just add the missing headers to the new file

    $ awk 'p!=$1 {print "# Fasta definition line: >" $1; 
                  print "##sequence-region"; 
                  p=$1}1' file
    
    # Fasta definition line: >contig00047
    ##sequence-region
    contig00047 AUGUSTUS new annotation
    contig00047 AUGUSTUS new annotation
    contig00047 AUGUSTUS new annotation
    contig00047 AUGUSTUS new annotation
    contig00047 AUGUSTUS new annotation
    contig00047 AUGUSTUS new annotation
    # Fasta definition line: >contig00048
    ##sequence-region
    contig00048 AUGUSTUS new annotation
    contig00048 AUGUSTUS new annotation
    contig00048 AUGUSTUS new annotation
    contig00048 AUGUSTUS new annotation