How to extract age and gender from reddit post titles?

I am trying to scrape Reddit posts of subreddits where a lot of questions are in the form:

s1 = "I [22M] and my partner (21F) are foo and bar"

s2 = "My (22m) and my partner (21m) are bar and foo"

I want to make a function that can parse each string and then return age and gender pairs. So:

def parse(s1):
 ....
 return [(22, "male"), (21, "female")]

Essentially, each age/gender tag is a two-digit number followed by either f, F, m, M.

Solution

You could try to extract the matches using this Regex:

(?:[\[\(])(\d{1,2})([MF])(?:[\]\)]) /i

Demo

For the python part of things I would recommend re's findall method:

import re

def parse(title):
    return re.findall(r'(?:\[|\()(\d{1,2})([MF])(?:\]|\))', title, re.IGNORECASE)

title = 'I [22M] and my partner (21F) are foo and bar'
matches = parse(title)

print(matches)

Demo

EDIT:

You could try to modify your Regex to this, in order to fit the new requirement you mentioned in your comment:

(?:[\[\(])(\d{1,2})\s?([MF]|male|female)(?:[\]\)]) /i

Demo