I would like to parse an IRC message from Twitch to a list of dictionaries, accounting for emotes.
Here is a sample of what I can get from Twitch:
"Testing. :) Confirmed!"
{"emotes": [(1, (9, 10))]}
It describes that there is the emote with ID 1 from characters 9 to 10 (with the string being zero-indexed).
I would like to have my data in the following format:
[
{
"type": "text",
"text": "Testing. "
},
{
"type": "emote",
"text": ":)",
"id": 1
},
{
"type": "text",
"text": " Confirmed!"
}
]
Is there a relatively clean way to accomplish this?
I'm not sure if your incoming message looks like this:
message = '''\
"Testing. :) Confirmed!"
{"emotes": [(1, (9, 10))]}'''
Or
text = "Testing. :) Confirmed!"
meta = '{"emotes": [(1, (9, 10))]}'
I'm going to assume it's the latter, because it's easy to convert from the former to the latter. It could also be that those are the python representations. You weren't very clear.
There's a vastly better way to approach this problem by not using regexes and just using string parsing:
import json
text = 'Testing. :) Confirmed! :P'
print(len(text))
meta = '{"emotes": [(1, (9, 10)), (2, (23,25))]}'
meta = json.loads(meta.replace('(', '[').replace(')', ']'))
results = []
cur_index = 0
for emote in meta['emotes']:
results.append({'type': 'text', 'text': text[cur_index:emote[1][0]]})
results.append({'type': 'emote', 'text': text[emote[1][0]:emote[1][1]+1],
'id': emote[0]})
cur_index = emote[1][1]+1
if text[cur_index:]:
results.append({'type': 'text', 'text': text[cur_index:]})
import pprint; pprint.pprint(results)
From your comment, the data comes in a custom format. There were a couple of characters that I copy/pasted from the comment that I'm not sure actually show up in the incoming data, I hope I got that part right. There was also only one type of emote in the message so I'm not entirely sure how it denotes multiple different emote types - I'm hoping that there's some separator and not multiple emote=
sections, or this approach needs some heavy modifications, but this should provide the parsing without the need for regex.
from collections import namedtuple
Emote = namedtuple('Emote', ('id', 'start', 'end'))
def parse_emotes(raw):
emotes = []
for raw_emote in raw.split('/'):
id, locations = raw_emote.split(':')
id = int(id)
locations = [location.split('-')
for location in locations.split(',')]
for location in locations:
emote = Emote(id=id, start=int(location[0]), end=int(location[1]))
emotes.append(emote)
return emotes
data = r'@badges=moderator/1;color=#0000FF;display-name=2Cubed;emotes=25:6-10,12-16;id=05aada01-f8c1-4b2e-a5be-2534096057b9;mod=1;room-id=82607708;subscriber=0;turbo=0;user-id=54561464;user-type=mod:2cubed!2cubed@2cubed.tmi.twitch.tv PRIVMSG #innectic :Hiya! Kappa Kappa'
meta, msgtype, channel, message = data.split(' ', maxsplit=3)
meta = dict(tag.split('=') for tag in meta.split(';'))
meta['emotes'] = parse_emotes(meta['emotes'])