Search code examples
linuxawkcommand-linegrepcut

How to extract specific key value pairs from a grep output


I have the output of grep in a folder as below,

./Data1/TEST_Data1.xml:<def-query collection="FT_R1Event" count="-1" desc="" durationEnd="1" durationStart="0" durationType="CAL" fromWS="Data1" id="_q1" timeUnit="D">

./Data2/TEST_Data2.xml:<def-query collection="FT_R2Event" count="-1" desc="" durationEnd="2" durationStart="0" durationType="ABS" fromWS="Data2" id="_q1" timeUnit="M">

I want to extract the below followed by some delimiter, say ',' as below,

Data1/TEST_Data1, durationEnd="1", timeUnit="D"

Data2/TEST_Data2, durationEnd="2", timeUnit="M"

Please help me in achieveing this using the basic linux commands.


Solution

  • I would do it using GNU AWK following way. Let file.txt content be

    ./Data1/TEST_Data1.xml:<def-query collection="FT_R1Event" count="-1" desc="" durationEnd="1" durationStart="0" durationType="CAL" fromWS="Data1" id="_q1" timeUnit="D">
    
    ./Data2/TEST_Data2.xml:<def-query collection="FT_R2Event" count="-1" desc="" durationEnd="2" durationStart="0" durationType="ABS" fromWS="Data2" id="_q1" timeUnit="M">
    

    then

    awk 'BEGIN{OFS=", ";FPAT="(^[^ ]+xml)|((durationEnd|timeUnit)=\"[^\"]+\")"}{gsub(/\.([/]|xml)/, "", $1);print}' file.txt
    

    output

    Data1/TEST_Data1, durationEnd="1", timeUnit="D"
    
    Data2/TEST_Data2, durationEnd="2", timeUnit="M"
    

    Explanation: I used FPAT to extract interesting elements of input, namely these which from start does not contain spaces and are following by xml or ((durationEnd or timeUnit) followed by " non-" "). Then I remove . followed by / or xml (note that . has to be literal . so it is escaped). Then I print everything, which is joined by , as I set it as output field seperator (OFS).

    Disclaimer: I tested it only with shown samples.

    (tested in gawk 4.2.1)