Search code examples
pythonregexregular-language

use regular expression to replace extra quote inside json


I got an unexpected quote in my json string that make json.loads(jstr) fails.

json_str = '''{"id":"9","ctime":"2018-02-13","content":"abcd: "efg.","hots":"103b","date_sms":"2017-11-22"}'''

So I'd like to use the regular expression to match and delete the quote inside the value of "content". I tried something in other solution:

import re
json_str = '''{"id":"9","ctime":"2018-02-13","content":"abcd: "efg.","hots":"103b","date_sms":"2017-11-22"}'''
pa = re.compile(r'(:\s+"[^"]*)"(?=[^"]*",)')
pa.findall(json_str)

[out]: []

Is there any way to fix the string?


Solution

  • As noted by @jonrsharpe, you'd be far better off cleaning the source.
    That said, if you do not have control over where the extra quote is coming from, you could use (*SKIP)(*FAIL) using the newer regex module and neg. lookarounds like so:

    "[^"]+":\s*"[^"]+"[,}]\s*(*SKIP)(*FAIL)|(?<![,:])"(?![:,]\s*["}])
    

    See a demo on regex101.com.


    In Python:

    import json, regex as re
    
    json_str = '''{"id":"9","ctime":"2018-02-13","content":"abcd: "efg.","hots":"103b","date_sms":"2017-11-22"}'''
    
    # clean the json
    rx = re.compile('''"[^"]+":\s*"[^"]+"[,}]\s*(*SKIP)(*FAIL)|(?<![,:])"(?![:,]\s*["}])''')
    json_str = rx.sub('', json_str)
    
    # load it
    
    json = json.loads(json_str)
    print(json['id'])
    # 9