regexawksedgrep

Merge multi-line cell in double quotes


I have this tsv (tab separated) file having 2 columns. The first column is a single (or group of) words and second column is it's meaning.

test file

test    try
test    "a short exam to measure somebody's knowledge 
or skill in something."
testing examine

I am trying to merge second and third line because it is in double quotes. For e.g.

Expected Output

test    try
test    "a short exam to measure somebody's knowledge or skill in something."
testing examine

I tried this:

awk -v FS='\t' -v OFS='\t' '{print $1, $2}' test.tsv
test    try
test    "a short exam to measure somebody's knowledge
or skill in something."
testing examine

But it does not merge the line 2 and 3. I tried "partsplit" and that merged all lines together.

awk 'BEGIN { FS=OFS="\t"}
{
    if (patsplit($0,a,/"[^"]+"/,s)) {
        gsub(/\n/,"",a[1])
        printf "%s%s%s", s[0],a[1],s[1]
    }
    else
        printf "%s", $0
    printf ";"
}' test.tsv

I need to keep the tab separated format like the original file. The only change required is to merge text in 2 double quotes.


Solution

  • You can set the output record separator to an empty string when the second field begins with a double quote, and set it to a newline again when the record ends with a double quote:

    awk -F'\t' '$2~/^"/{ORS=""}/"$/{ORS="\n"}1'
    

    Demo: https://awk.js.org/?snippet=nEx499

    To generalize this so that all multi-line columns enclosed in double quotes can be merged, you can set the output separator to an empty string upon an unterminated double quoted string, and set it to a newline again upon a terminating double quote:

    awk '/"($|\t)/{ORS="\n"}/(^|\t)"[^\t"]*$/{ORS=""}1'
    

    Demo: https://awk.js.org/?snippet=LzpEGA