Search code examples
parsingshelldata-conversionreformat

Convert rows based entry to column based in shell


I need your help on a multiple row entry into different columns. And do the same with all the entries in file.

File example (showing 2 entries only, there are many like these):

>ABC
*
AGA-AUUCUC-CGGUUCAAUCU
|||
UCUAUAACCGCGCCGAGUUAGU

>ABC
*
AGAUAU-GCUGCAGGCUCAAUUG
||||||
UCUAUAACCGCG-CCGAGUUAGU

File format required:

>ABC AGA-AUUCUC-CGGUUCAAUCU UCUAUAACCGCGCCGAGUUAGU
>ABC AGAUAU-GCUGCAGGCUCAAUUG UCUAUAACCGCG-CCGAGUUAGU

I am able to convert single entry into required format by:

tr '\n' '\t' <test3 | awk '{print $1,$3,$5}'

But how do I do it with all entries by reading whole file?


Solution

  • I think you were on the right track with your original awk solution. Try this; I think it's a good combination of readable and effective:

    awk 'BEGIN { RS="\n\n" } ; { print $1, $3, $5 }' < myfile
    

    The idea is to tell awk to treat the blank lines (2 consecutive newlines) as record separators. Then each stanza is treated as a single record, and the whitespace (in this case, single newlines) separates the fields. This is pretty similar to what you were doing with tr, except now awk will run through the whole file processing a stanza at a time.