Search code examples
jsonbashexport-to-csvjqgsub

JQ replace "\n" from inside an encoded field


I have by now tried a lot of solutions, but i am afraid my understanding of JQ is laking, I just started trying to use it from 2 days ago.

I got a pretty nice solution to parse my files from Json to Csv, but for 1 little pitfal.

The Json have inside it 1 field that is base64 encoded (.data), which is in itself a encoded Json. Inside this sub-Json, theres a field with text (.text) that have "\n", and when converting, well, the row gets corrupted as the "\n" is translated to actual line breaks in the final file.

This is the command I have now:

jq-linux64 -r '["ackId","data","senderPhoneNumber","eventType","eventId","messageId2","postbackData","text","sendTime","project_number","type","event_type","product","messageId","publishTime"], (.receivedMessages[] | [.ackId, .message.data, (.message.data | @base64d | fromjson | .senderPhoneNumber, .eventType, .eventId, .messageId, .postbackData, .text, .sendTime), .message.attributes.project_number, .message.attributes.type, .message.attributes.event_type, .message.attributes.product, .message.messageId,.message.publishTime]) | @csv' <inputfile.txt >outputfile.txt

In this command I do the decode, use the "fromjson" and then grab the fields i need from inside it.

The field ".text" is the one that has the linebreaks I wish removed/replaced.

I tried sub and gsub, but was unable to create a valid command line or one that yields the desired output.

Any tips regarding how to replace "\n" from inside the field ",text" ?

Sample Input File:

{
  "receivedMessages": [
    {
      "ackId": "xxxxxx",
      "message": {
        "data": "eyJzZW5kZXJQaG9uZU51bWJlciI6ICIrOTk5OTk5OTk5OTk5OSIsIm1lc3NhZ2VJZCI6ICIyM2QyM2QyM2QzMmQiLCJzZW5kVGltZSI6ICIyMDIxLTAyLTAyVDIwOjAwOjAwLjAwMDAxIiwidGV4dCI6ICJYWFhYWCBYWFhYWFhYRSwgWFhYWFhYWC8gWHh4eHh4LlxuXG5YeHh4eHh4eHg6XG5cbjEgeHh4eCB4eCB4eHh4ICgyMDAgeHgpXG4yLDUgeHh4eHhcbjEgeHh4eCBcblxuWHh4eHh4IHh4eHh4eCJ9Cg==",
        "attributes": {
          "product": "XXXX",
          "project_number": "XXXXXX",
          "message_type": "TEXT",
          "type": "message"
        },
        "messageId": "234234234234234234",
        "publishTime": "2021-02-02T20:15:22.888Z"
      }
    }
    ]
}

When I process this file with the above command, it gives:

"ackId","data","senderPhoneNumber","eventType","eventId","messageId2","postbackData","text","sendTime","project_number","type","event_type","product","messageId","publishTime"
"xxxxxx","eyJzZW5kZXJQaG9uZU51bWJlciI6ICIrOTk5OTk5OTk5OTk5OSIsIm1lc3NhZ2VJZCI6ICIyM2QyM2QyM2QzMmQiLCJzZW5kVGltZSI6ICIyMDIxLTAyLTAyVDIwOjAwOjAwLjAwMDAxIiwidGV4dCI6ICJYWFhYWCBYWFhYWFhYRSwgWFhYWFhYWC8gWHh4eHh4LlxuXG5YeHh4eHh4eHg6XG5cbjEgeHh4eCB4eCB4eHh4ICgyMDAgeHgpXG4yLDUgeHh4eHhcbjEgeHh4eCBcblxuWHh4eHh4IHh4eHh4eCJ9Cg==","+9999999999999",,,"23d23d23d32d",,"XXXXX XXXXXXXE, XXXXXXX/ Xxxxxx.

Xxxxxxxxx:

1 xxxx xx xxxx (200 xx)
2,5 xxxxx
1 xxxx

Xxxxxx xxxxxx","2021-02-02T20:00:00.00001","XXXXXX","message",,"XXXX","234234234234234234","2021-02-02T20:15:22.888Z"

The field ".text" generates line breaks from the encoded "\n", that make 1 row become multiple rows, thus corrupting this row for latter processing.

---Full Answer code as answered by @peak

jq-linux64 -r '["ackId","data","senderPhoneNumber","eventType","eventId","messageId2","postbackData","text","sendTime","project_number","type","event_type","product","messageId","publishTime"], (.receivedMessages[] | [.ackId, .message.data, (.message.data|@base64d|fromjson|.senderPhoneNumber,.eventType,.eventId,.messageId,.postbackData,(.text | tostring | gsub("\n"; "|")),.sendTime),.message.attributes.project_number,.message.attributes.type,.message.attributes.event_type,.message.attributes.product,.message.messageId,.message.publishTime]) | @csv' <input.json >output.csv

Solution

  • Using the -r option with the @csv filter is intended to (and indeed in this case does) produce valid CSV according to the various CSV standards which allow embedded newlines within quoted string values.

    Thus the problem you seem to be facing is that the program which you are using to ingest the CSV does not accept embedded newlines within double-quoted fields.

    The simplest workaround may therefore be to "encode" the newline characters in a way that will be acceptable to both you and the other program.

    For example, you could encode the newlines as pipe ("|") symbols using the jq filter:

    gsub("\n"; "|")
    

    Assuming "text" is a string-valued field which might contain embedded newlines, you would replace .text with

    (.text | gsub("\n"; "|"))
    

    If the value of .text might be null, then you will have to decide how you want to handle nulls. (One possibility would be: .text | tostring | gsub("\n";"|").)

    If you want the transformation to apply to all of the string fields, then replace | @csv by something like:

    | map(if type == "string" then gsub("\n"; "|") else . end)
    | @csv
    

    https://www.rfc-editor.org/rfc/rfc4180

    Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.