Search code examples
regexawksedgrep

Merge multi-line cell in double quotes


I have this tsv (tab separated) file having 2 columns. The first column is a single (or group of) words and second column is it's meaning.

test file

test    try
test    "a short exam to measure somebody's knowledge 
or skill in something."
testing examine

I am trying to merge second and third line because it is in double quotes. For e.g.

Expected Output

test    try
test    "a short exam to measure somebody's knowledge or skill in something."
testing examine

I tried this:

awk -v FS='\t' -v OFS='\t' '{print $1, $2}' test.tsv
test    try
test    "a short exam to measure somebody's knowledge
or skill in something."
testing examine

But it does not merge the line 2 and 3. I tried "partsplit" and that merged all lines together.

awk 'BEGIN { FS=OFS="\t"}
{
    if (patsplit($0,a,/"[^"]+"/,s)) {
        gsub(/\n/,"",a[1])
        printf "%s%s%s", s[0],a[1],s[1]
    }
    else
        printf "%s", $0
    printf ";"
}' test.tsv

I need to keep the tab separated format like the original file. The only change required is to merge text in 2 double quotes.


Solution

  • To just replace each newline within quoted fields with a blank character, using GNU awk for multi-char RS and RT:

    $ awk -v RS='"[^"]*"' '{gsub(/\n/," ",RT); ORS=RT} 1' file
    test    try
    test    "a short exam to measure somebody's knowledge or skill in something."
    testing examine
    

    The above will work no matter where the double quotes appear in your input, even if your quoted string includes double quotes that have been escaped by putting a second double quote next to them as is common in quoted fields in CSVs, TSVs, etc., e.g.:

    $ cat file
    test    try
    test    a short exam to measure somebody's "knowledge
    or skill" in something.
    testing examine
    
    test    try
    test    "a short exam to measure somebody's ""knowledge""
    or skill in something."
    testing examine
    

    $ awk -v RS='"[^"]*"' '{gsub(/\n/," ",RT); ORS=RT} 1' file
    test    try
    test    a short exam to measure somebody's "knowledge or skill" in something.
    testing examine
    
    test    try
    test    "a short exam to measure somebody's ""knowledge"" or skill in something."
    testing examine
    

    See What's the most robust way to efficiently parse CSV using awk? for more info on parsing CSVs (which can also be applied to TSVs) with awk.

    In response to the comments below - the awk command is doing exactly 1 thing every time - replacing each \n with a blank, that's what gsub(/\n/," ",...) does. If that's not what you want then just don't do that, do whatever you want to do instead, but you never said in your question how you want to merge the lines so I had to guess at something.

    I'd recommend you don't just remove newlines as the other solutions do as that will concatenate words if there ever aren't spaces around the \ns but maybe you want gsub(/[[:space:]]*\n[[:space:]]*/," ",...) or similar, I don't know.

    Here's some other input to consider:

    $ cat file
    test    new first
    test    "a short exam to measure somebody's knowledge
    or skill in something."
    testing examine
    
    test    new second
    test    "a short exam to measure somebody's knowledge
            or skill in something."
    testing examine
    

    The first one "new first" does not have a blank at the end of the line after knowledge and the second one "new second" has a tab at the start of the line before or. Now let's put all of the above test cases into one file (long spaces are tabs):

    $ cat file
    test    try
    test    "a short exam to measure somebody's knowledge
    or skill in something."
    testing examine
    
    test    try
    test    a short exam to measure somebody's "knowledge
    or skill" in something.
    testing examine
    
    test    try
    test    "a short exam to measure somebody's ""knowledge""
    or skill in something."
    testing examine
    
    test    new first
    test    "a short exam to measure somebody's knowledge
    or skill in something."
    testing examine
    
    test    new second
    test    "a short exam to measure somebody's knowledge
            or skill in something."
    testing examine
    

    and then test that with all of the current answers:

    1. Mine tweaked to use gsub(/[[:space:]]*\n[[:space:]]*/," ",...) instead of gsub(/\n/," ",...):
    $ awk -v RS='"[^"]*"' '{gsub(/[[:space:]]*\n[[:space:]]*/," ",RT); ORS=RT} 1' file
    test    try
    test    "a short exam to measure somebody's knowledge or skill in something."
    testing examine
    
    test    try
    test    a short exam to measure somebody's "knowledge or skill" in something.
    testing examine
    
    test    try
    test    "a short exam to measure somebody's ""knowledge"" or skill in something."
    testing examine
    
    test    new first
    test    "a short exam to measure somebody's knowledge or skill in something."
    testing examine
    
    test    new second
    test    "a short exam to measure somebody's knowledge or skill in something."
    testing examine
    
    1. @potong's sed:
    $ sed ':a;N;/\n[^\t]*$/s/\n//;ta;P;D' file
    test    try
    test    "a short exam to measure somebody's knowledge or skill in something."testing examine
    test    try
    test    a short exam to measure somebody's "knowledgeor skill" in something.testing examine
    test    try
    test    "a short exam to measure somebody's ""knowledge""or skill in something."testing examinetest    new firsttest    "a short exam to measure somebody's knowledgeor skill in something."testing examinetest    new secondtest    "a short exam to measure somebody's knowledge
            or skill in something."testing examine
    
    1. @blhsing's awk:
    
    $ awk -F'\t' '$2~/^"/{ORS=""}/"$/{ORS="\n"}1' file
    test    try
    test    "a short exam to measure somebody's knowledge or skill in something."
    testing examine
    
    test    try
    test    a short exam to measure somebody's "knowledge
    or skill" in something.
    testing examine
    
    test    try
    test    "a short exam to measure somebody's ""knowledge""
    or skill in something."
    testing examine
    
    test    new first
    test    "a short exam to measure somebody's knowledge
    or skill in something."
    testing examine
    
    test    new second
    test    "a short exam to measure somebody's knowledge
            or skill in something."
    testing examine
    

    so you can decide which is producing the behavior you want.