I am trying to scrape Reddit posts of subreddits where a lot of questions are in the form:
s1 = "I [22M] and my partner (21F) are foo and bar"
s2 = "My (22m) and my partner (21m) are bar and foo"
I want to make a function that can parse each string and then return age and gender pairs. So:
def parse(s1):
....
return [(22, "male"), (21, "female")]
Essentially, each age/gender tag is a two-digit number followed by either f, F, m, M
.
You could try to extract the matches using this Regex:
(?:[\[\(])(\d{1,2})([MF])(?:[\]\)]) /i
For the python part of things I would recommend re
's findall
method:
import re
def parse(title):
return re.findall(r'(?:\[|\()(\d{1,2})([MF])(?:\]|\))', title, re.IGNORECASE)
title = 'I [22M] and my partner (21F) are foo and bar'
matches = parse(title)
print(matches)
EDIT:
You could try to modify your Regex to this, in order to fit the new requirement you mentioned in your comment:
(?:[\[\(])(\d{1,2})\s?([MF]|male|female)(?:[\]\)]) /i