I've got some JSON that looks like this:
{"name": "John",
"description": "I'm just \"A BOY\" okay? He said \"Hello, World!\" to everyone.",
"remark": "\"This is a test\" he mentioned."}
And the \"
instances are breaking json.loads()
.
import json
json_string = '''{"name": "John",
"description": "I'm just \"A BOY\" okay? He said \"Hello, World!\" to everyone.",
"remark": "\"This is a test\" he mentioned."}'''
data = json.loads(json_string)
print(data)
raises:
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 2 column 27 (char 43)
I feel like I've tried every regex under the sun to target these instances (but leave all the other double quotes, not preceded by a backslash) and replace them with an empty string (functionally just strip them). If anyone has tips I'd appreciate it.
My implementation right now is something like:
import re
# Define a regular expression pattern to match \" within a string
pattern = r'\\"'
# Use re.sub to replace all occurrences of the pattern with an empty string
cleaned_string = re.sub(pattern, '', json_string)
print(cleaned_string)
But when i run this in a repl, nothing changes.
For reference, I'd just like the output to be:
{"name": "John",
"description": "I'm just A BOY okay? He said Hello, World! to everyone.",
"remark": "This is a test he mentioned."}
Edit: for clarity this is just an example of the nature of the input data i'm working with, its coming from AWS Cloudwatch logs so I don't have an easy way to manipulate the input before dragging it into Python. For example, part of the payload is something like
"\"Girl Let's Talk\" Virtual 90s Kickback"
In context:
{"search_ads": [ {"event_id": "4838383", "ad_id": "1112", "budget_amount": 5.0, "currency": "USD", "marketplace": "Online_US", "score": 18.205433, "p_click": 0.0, "p_order": 0.0, "goal": 2, "category_id": 113, "subcategory_id": 13999, "format": null, "is_paid": false, "online_event": true, "event_start_date": "2024-06-28T00:00:00Z", "latitude": null, "longitude": null, "name": "\"Girl Let's Talk\" Virtual 90s Kickback", "vip_status": false, "is_participant": true}]}
so the \"
characters are really the only problem - if I copy all that input into VS Code and just search for/delete that pattern, json.loads()
works great as is.
As one commenter mentioned, i think what im looking for is a regex that will match and strip the pattern \"
but ive had no luck with that so far! Ive only been able to strip either the \s
, which leaves me with double quotes that break json.loads()
(expecting delimiter aka thinks this is another JSON key/val pair) or stripping all the double-quotes, which of course completely breaks the same.
You do not need to remove \"
. It's part of the data.*
What you're having a problem with is Python's interpretation of string literals. The sequence \"
is an escape sequence that turns into just "
.
>>> '\"'
'"'
This can be solved with a raw string (r
prefix).
import json
json_string = r'''
{"name": "John",
"description": "I'm just \"A BOY\" okay? He said \"Hello, World!\" to everyone.",
"remark": "\"This is a test\" he mentioned."}
'''
data = json.loads(json_string)
print(data['description'])
Output:
I'm just "A BOY" okay? He said "Hello, World!" to everyone.
However, you might prefer to put the JSON in a separate file and use json.load()
, to avoid having to muck around with string literals at all.
* To be more precise, it's part of the JSON. In a JSON string, \"
represents "
, which is the raw data.