Search code examples
jsonregexcsvapache-nifi

Regex: Remove Commas within quotes


I'm using NiFi and I have a series of JSONs that look like this:

{
  "url": "RETURNED URL",
  "repository_url": "RETURNED URL",
  "labels_url": "RETURNED URL",
  "comments_url": "RETURNED URL",
  "events_url": "RETURNED URL",
  "html_url": "RETURNED URL",
  "id": "RETURNED_ID",
  "node_id": "RETURNED id",
  "number": 10,
    ...
  "author_association": "xxxx",
  "active_lock_reason": null,
  "body": "text text text, text text, text text text, text, text text",
  "performed_via_github_app": null
}

My focus is on the "body" attribute. Because I'm merging them into one giant JSON to convert into a csv, I need the commas within the "body" text to go away (to help with possible NLP later down the road as well). I know I can just use the replace text, but capturing the commas themselves is the part I'm struggling with. So far I have the following:

((?<="body"\s:\s").*(?=",))

Every guide I look at, though, doesn't match the commas within the quotes. Any suggestions?


Solution

  • You can use

    (\G(?!^)|\"body\"\s*:\s*\")([^\",]*),
    

    In case there are escape sequences in the string use

    (\G(?!^)|\"body\"\s*:\s*\")([^\",\\]*(?:\\.[^\",\\]*)*),
    

    See the regex demo (and regex demo #2), replace with $1$2.

    Details:

    • (\G(?!^)|\"body\"\s*:\s*\") - Group 1: end of the previous match or "body", zero or more whitespaces, :, zero or more whitespaces
    • ([^\",]*) - Group 2 ($2): any zero or more chars other than " and ,
    • , - a comma (to be removed/replaced).