Search code examples
jsonpython-3.xregexlarge-data

JSON- Regex to identify a pattern in JSON


I'm new to Python3 and I am working with large JSON objects. I have a large JSON object which has extra chars coming in between two JSON objects, in between the braces.

For example:

{"id":"121324343", "name":"foobar"}3$£_$£rvcfddkgga£($(>..bu&^783 { "id":"343554353", "name":"ABCXYZ"}'

These extra chars could be anything alphanumeric, special chars or ASCII. They appear in this large JSON multiple times and can be of any length. I'm trying to use regex to identify that pattern to remove them, but regex doesn't seem to work. Here is the regex I used:

(^}\n[a-zA-Z0-9]+{$)

Is there a way of identifying such patter using regex in python?


Solution

  • You can select the dictionary data based on named capture groups. As a bonus, this will also ignore any { or } within the extra chars.

    The following pattern works on the provided data:

    "\"id\"\:\"(?P<id>\d+?)\"[,\s]+\"name\"\:\"(?P<name>[ \w]+)\""
    

    Example

    import re
    from pprint import pprint
    
    string = \
        """
        {"id":"121324343", "name":"foobar"}3$£_$£rvcfdd{}kgga£($(>..bu&^783 { "id":"343554353", "name":"ABC XYZ"}'
        """
    
    pattern = re.compile(pattern="\"id\"\:\"(?P<id>\d+?)\"[,\s]+\"name\"\:\"(?P<name>[ \w]+)\"")
    pprint([match.groupdict() for match in pattern.finditer(string=string)])
    
    • Output
    [{'id': '121324343', 'name': 'foobar'}, {'id': '343554353', 'name': 'ABC XYZ'}]
    

    Notes

    For this example I assume the following:

    • id only contains integer digits.
    • name is a string that can contain the following characters [a-zA-Z0-9_ ]. (this includes white spaces and underscores).