Search code examples
shellunixsedseparator

How can I merge multiple lines to create exactly two records based on field separators?


I need help writing a Unix script loop to process the following data:

200250|Wk50|200212|January|20024|Quarter4|2002|2002
|2003-01-12
|2003-01-18
|2003-01-05
|2003-02-01
|2002-11-03
|2003-02-01|
|2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002
|2002-10-27
|2002-11-02
|2002-10-06
|2002-11-02
|2002-08-04
|2002-11-02|
|2003-02-01|||||||

I have data in above format in a text file. What I need to do is remove newline characters on all lines which have | as the first character in the next line. The output I need is:

200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02 |2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

I need some help to achieve this. These shell commands are giving me nightmares!


Solution

  • Here is an awk solution:

    $ awk 'substr($0,1,1)=="|"{printf $0;next} {printf "\n"$0} END{print""}' data
    
    200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
    200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||
    

    Explanation:

    Awk implicitly loops through every line in the file.

    • substr($0,1,1)=="|"{printf $0;next}

      If this line begins with a vertical bar, then print it (without a final newline) and then skip to the next line. We are using printf here, as opposed to the more common print, so that newlines are not printed unless we explicitly ask for them.

    • {printf "\n"$0}

      If the line didn't begin with a vertical bar, print a newline and then this line (again without a final newline).

    • END{print""}

      At the end of the file, print a newline.

    Refinement

    The above prints out an extra newline at the beginning of the file. If that is a problem, then it can be eliminated with just a minor change:

    $ awk 'substr($0,1,1)=="|"{printf $0;next} {printf new $0;new="\n"} END{print""}' data
    200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
    200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||