Search code examples
phppythonautomationbotsreddit

What would be the best way to collect the titles (in bulk) of a subreddit


I am looking to collect the titles of all of the posts on a subreddit, and I wanted to know what would be the best way of going about this?

I've looked around and found some stuff talking about Python and bots. I've also had a brief look at the API and am unsure in which direction to go.

As I do not want to commit to find out 90% of the way through it won't work, I ask if someone could point me in the right direction of language and extras like any software needed for example pip for Python.

My own experience is in web languages such as PHP so I initially thought of a web app would do the trick but am unsure if this would be the best way and how to go about it.

So as my question stands

What would be the best way to collect the titles (in bulk) of a subreddit?

Or if that is too subjective

How do I retrieve and store all the post titles of a subreddit?

Preferably needs to :

  • do more than 1 page of (25) results
  • save to a .txt file

Thanks in advance.


Solution

  • PHP; in 25 lines:

    $subreddit = 'pokemon';
    $max_pages = 10;
    
    // Set variables with default data
    $page = 0;
    $after = '';
    $titles = '';
    do {
        $url = 'http://www.reddit.com/r/' . $subreddit . '/new.json?limit=25&after=' . $after;
    
        // Set URL you want to fetch
        $ch = curl_init($url);
    
        // Set curl option of of header to false (don't need them)
        curl_setopt($ch, CURLOPT_HEADER, 0);
    
        // Set curl option of nobody to false as we need the body
        curl_setopt($ch, CURLOPT_NOBODY, 0);
    
        // Set curl timeout of 5 seconds
        curl_setopt($ch, CURLOPT_TIMEOUT, 5);
    
        // Set curl to return output as string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    
        // Execute curl
        $output = curl_exec($ch);
    
        // Get HTTP code of request
        $status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    
        // Close curl
        curl_close($ch);
    
        // If http code is 200 (success)
        if ($status == 200) {
            // Decode JSON into PHP object
            $json = json_decode($output);
            // Set after for next curl iteration (reddit's pagination)
            $after = $json->data->after;
            // Loop though each post and output title
            foreach ($json->data->children as $k => $v) {
                $titles .= $v->data->title . "\n";
            }
        }
        // Increment page number
        $page++;
    // Loop though whilst current page number is less than maximum pages
    } while ($page < $max_pages);
    
    // Save titles to text file
    file_put_contents(dirname(__FILE__) . '/' . $subreddit . '.txt', $titles);