Search code examples
regexpython-3.xnlpreddit

How to extract age and gender from reddit post titles?


I am trying to scrape Reddit posts of subreddits where a lot of questions are in the form:

s1 = "I [22M] and my partner (21F) are foo and bar"

s2 = "My (22m) and my partner (21m) are bar and foo"

I want to make a function that can parse each string and then return age and gender pairs. So:

def parse(s1):
 ....
 return [(22, "male"), (21, "female")]

Essentially, each age/gender tag is a two-digit number followed by either f, F, m, M.


Solution

  • You could try to extract the matches using this Regex:

    (?:[\[\(])(\d{1,2})([MF])(?:[\]\)]) /i
    

    Demo

    For the python part of things I would recommend re's findall method:

    import re
    
    def parse(title):
        return re.findall(r'(?:\[|\()(\d{1,2})([MF])(?:\]|\))', title, re.IGNORECASE)
    
    title = 'I [22M] and my partner (21F) are foo and bar'
    matches = parse(title)
    
    print(matches)
    

    Demo

    EDIT:

    You could try to modify your Regex to this, in order to fit the new requirement you mentioned in your comment:

    (?:[\[\(])(\d{1,2})\s?([MF]|male|female)(?:[\]\)]) /i
    

    Demo