Search code examples
pythonpandasredditpraw

How to save progress from scraping reddit data as database file using python


I wrote a script for getting some posts from reddit.

import praw
import pandas as pd
reddit = praw.Reddit(client_id='*******', \
                     client_secret='*******', \
                     user_agent='**********', \
                     username='********', \
                     password='*******8')
subreddit1 = reddit.subreddit("Tea")
subreddit2 = reddit.subreddit("Biophysics")
top_subreddit1 = subreddit1.top(limit=500)
top_subreddit2 = subreddit2.top(limit=500)
topics_dict = { "title":[],
                "score":[],
                "id":[], "url":[], 
                "comms_num": [],
                "created": [],
                "body":[]}
for submission1 in top_subreddit1:
    topics_dict["title"].append(submission1.title)
    topics_dict["score"].append(submission1.score)
    topics_dict["id"].append(submission1.id)
    topics_dict["url"].append(submission1.url)
    topics_dict["comms_num"].append(submission1.num_comments)
    topics_dict["created"].append(submission1.created)
    topics_dict["body"].append(submission1.selftext)
for submission2 in top_subreddit2:
    topics_dict["title"].append(submission2.title)
    topics_dict["score"].append(submission2.score)
    topics_dict["id"].append(submission2.id)
    topics_dict["url"].append(submission2.url)
    topics_dict["comms_num"].append(submission2.num_comments)
    topics_dict["created"].append(submission2.created)
    topics_dict["body"].append(submission2.selftext)
topics_data = pd.DataFrame(topics_dict)
topics_data

But it only displays in my jupyter. Now I want to save progress as a database file. All advice are appreciated.


Solution

  • You have a couple of options. I'll present two, each with their pros and cons:

    1. CSV

    Simply save your file to a .csv using DataFrame.to_csv:

    topics_data.to_csv('path_to_file.csv')
    

    You can then proceed to parse this csv file in your client application, i.e., whatever application is going to use your scraped data.

    Pros

    • Simple to save
    • Very barebones; further processing would be simple

    Cons

    • Very barebones; if you need any sort of structure, you won't have any flexibility

    2. SQLITE

    You can also opt to store the dataframe in sqlite using DataFrame.to_sql:

    import sqlite3
    
    db_file = 'my.db'
    # This creates a new database file if it doesn't exist
    db_conn = sqlite3.connect(db_file)
    # This creates a new table 'topics_data' if it doesn't exist
    topics_data.to_sql('topics_data', con=db_conn)
    

    Pros

    • Might be easier to parse for your client
    • SQL is a very strong query language. You can take advantage of this

    Cons

    • Could be overkill if all you need is basic data transfer

    Find out more about sqlite here: sqlite tutorial