Search code examples
pythonregexredditpraw

How do I check if a Reddit post contains only an image and nothing else?


Background: I'm currently making a Reddit bot using the praw library with Python 3.7. One of the things my bot needs to do is check the latest posts on some subreddit to see if they contain just an image and nothing else.

Given that there are different types of posts on Reddit (posts that are just an uploaded image and normal text posts with an image in them), I first decided to differentiate between these two possibilities. As far as I'm aware, praw doesn't provide any functionality to get the type of Reddit post.

To handle posts which are just images and nothing else, I just check the URL of the returned praw submission with a specific regex:

^http(s)?://i\.redd\.it/\w+\.(png|gif|jpg|jpeg)$

If the URL matches, I just download the image. This works. On the other hand, for text posts that happen to contain just an image, I check the selftext property, which is something like this for posts that contain just an image and nothing else:

​\n\nhttps://i.redd.it/xxxxxxxxxx.png

Using the regex above (with beginning and end markers removed), I can extract the URL and make sure only one is there through re.findall. However, how can I make sure that there is absolutely no text at all in the post (except whitespace and that weird escape sequence ​, which I don't understand its purpose)?


Solution

  • As far as I'm aware, praw doesn't provide any functionality to get the type of Reddit post.

    PRAW loads attributes dynamically from Reddit's response. To what's available on any given object, see the PRAW documentation section Determine Available Attributes of an Object. For a Submission, it recommends the following snippet:

    import pprint
    
    # assume you have a Reddit instance bound to variable `reddit`
    submission = reddit.submission(id='39zje0')
    print(submission.title) # to make it non-lazy
    pprint.pprint(vars(submission))
    

    This will print out a dict of the available attributes. Using this, you will discover the attributes .is_self and .is_reddit_media_domain. The first will tell you (as a boolean) whether or not a post is a self post, and the second will tell you (also as a boolean) whether a post is "reddit media," which also includes videos. Rather than matching the URL to a regex, just check that .is_reddit_media_domain is true and .domain == 'i.redd.it'.

    For example:

    In [5]: reddit.submission('anr0l2').is_self
    Out[5]: True
    
    In [6]: reddit.submission('anspgf').domain == 'i.redd.it'
    Out[6]: True
    
    In [7]: reddit.submission('antg2x').domain == 'i.redd.it'
    Out[7]: False
    

    how can I make sure that there is absolutely no text at all in the image

    What do you mean by "no text in the image"? What does it mean to you for an image to contain text?