Search code examples
pythonpython-3.xweb-scrapingpraw

How to filter out specific strings from a string


Python beginner here. I'm stumped on part of this code for a bot I'm writing.

I am making a reddit bot using Praw to comb through posts and removed a specific set of characters (steam CD keys).

I made a test post here: https://www.reddit.com/r/pythonforengineers/comments/91m4l0/testing_my_reddit_scraping_bot/

This should have all the formats of keys.

Currently, my bot is able to find the post using a regex expression. I have these variables:

steamKey15 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w')
steamKey25 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.')
steamKey17 = (r'\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\s\w\w')

I am finding the text using this:

subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):

    if submission.id not in steamKeyPostID:        
        if re.search(steamKey15, submission.selftext, re.IGNORECASE):
            searchLogic()
            saveSteamKey()

So this is just to show that the things I should be using in a filter function is a combination of steamKey15/25/17, and submission.selftext.

So here is the part where I am confused. I cant find a function that works, or is doing what I want. My goal is to remove all the text from submission.selftext(the body of the post) BUT the keys, which will eventually be saved in a .txt file.

Any advice on a good way to go around this? I've looked into re.sub and .translate but I don't understand how the parts fit together.

I am using Python 3.7 if it helps.


Solution

  • can't you just get the regexp results?

    m = re.search(steamKey15, submission.selftext, re.IGNORECASE)
    if m:
        print(m.group(0))
    

    Also note that a dot . means any char in a regexp. If you want to match only dots, you should use \.. You can probably write your regexp like this instead:

    r'\w{5}[-.]\w{5}[-.]\w{5}' 
    

    This will match the key when separated by . or by -.

    Note that this will also match anything that begin or end with a key, or has a key in the middle - that can cause you problems as your 15-char key regexp is contained in the 25-key one! To fix that use negative lookahead/negative lookbehind:

    r'(?<![\w.-])\w{5}[-.]\w{5}[-.]\w{5}(?![\w.-])'
    

    that will only find the keys if there are no extraneous characters before and after them

    Another hint is to use re.findall instead of re.search - some posts contain more than one steam key in the same post! findall will return all matches while search only returns the first one.