Search code examples
pythonperformanceweb-scrapingprawcoding-efficiency

Scraping data PRAW - How can I improve my code?


I have this code:

posts = []

subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft',  'AskTechnology', 'realtech', 
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess', 
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))

targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')


for sub_name in subs:
    for submission in reddit.subreddit(sub_name).hot(limit = 1):
        date = submission.created
        date = datetime.datetime.fromtimestamp(date)
        if date >= targeted_date and reddit.subreddit(sub_name).subscribers >= 35000:
            posts.append([date, submission.subreddit, reddit.subreddit(sub_name).subscribers, 
                      submission.title, submission.selftext])
        
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df

Runtime with limit = 16 (~500 rows): 905.9099962711334 s

Which gives me this results:

date    subreddit   subscribers title   text
0   2021-11-08 09:18:22 Bitcoin 3546142 Please upgrade your node to enable Taproot. 
1   2021-09-19 17:01:03 homeautomation  1333753 Looking for developers interested in helping t...   A while back I opened sourced all of my source...
2   2021-11-11 11:00:17 Entrepreneur    1036934 Thank you Thursday! - November 11, 2021 **Your opportunity to thank the** /r/Entrepren...
3   2021-11-08 01:36:05 oculus  396752  [Weekly] What VR games have you been enjoying ...   Welcome to the weekly recommendation thread! :...
4   2021-06-17 19:25:01 microsoft   141810  Microsoft: Official Support Thread  Microsoft: Official Support Thread\n\nMicrosof...
5   2021-11-12 11:02:14 investing   1946917 Daily General Discussion and spitballin thread...   Have a general question? Want to offer some c...
6   2021-11-12 04:16:13 tech    413040  Mars rover scrapes at rock to 'look at somethi...   
7   2021-11-12 12:00:15 wallstreetbets  11143628    Daily Discussion Thread for November 12, 2021   Your daily trading discussion thread. Please k...
8   2021-04-17 14:50:02 singularity 134940  Re: The Discord Link Expired, so here's a new ...   
9   2021-11-12 11:40:04 programming 3682438 It's probably time to stop recommending Clean ...   
10  2021-09-10 10:26:07 software    149655  What I do/install on every Windows PC - Softwa...   Hello, I have to spend a lot of time finding s...
11  2021-11-12 13:00:18 Android 2315799 Daily Superthread (Nov 12 2021) - Your daily t...   Note 1. Check [MoronicMondayAndroid](https://o...
12  2021-11-11 23:32:33 CryptoCurrency  3871810 Live Recording: Kevin O’Leary Talks About Cryp...   
13  2021-11-02 20:53:21 productivity    874076  Self-promotion/shout out thread This is the place to share your personal blogs...
14  2021-11-12 14:57:19 RenewableEnergy 97364   Northvolt produces first fully recycled batter...   
15  2021-11-12 08:00:16 gaming  30936297    Free Talk Friday!   Use this post to discuss life, post memes, or ...
16  2021-11-01 05:01:23 startups    884574  Share Your Startup - November 2021 - Upvote Th...   [r/startups](https://www.reddit.com/r/startups...
17  2021-11-01 09:00:11 HomeKit 107076  Monthly Buying Megathread - Ask which accessor...   Looking for lights, a thermostat, a plug, or a...
18  2021-11-01 13:00:13 dataisbeautiful 16467198    [Topic][Open] Open Discussion Thread — Anybody...   Anybody can post a question related to data vi...
19  2021-11-12 12:29:47 technews    339611  Peter Jackson sells visual effects firm for $1...   
20  2021-10-07 19:15:14 NFT 221897  Join our official —and the #1 NFT— Discord Ser...   
21  2020-12-01 12:11:36 google  1622449 Monthly Discussion and Support Thread - Decemb...   Have a question you need answered? A new Googl...

The issue is that it's taking way too much time. As you can see I set up a limit = 1 and it takes approx 1 min in to run. Yesterday, I set up the limit to 300, in order to analyze the data and it run for about 2 hours.

My question: Is there a way to change the code organization in order to limit the run time?

The bellow code used to work way faster, but I wanted to had a column subscriber number, and had to add a second for loop:

posts = []
subs = reddit.subreddit('Futurology+wallstreetbets+DataIsBeautiful+RenewableEnergy+Bitcoin+Android+programming+gaming+tech+google+hardware+oculus+software+startups+linus+microsoft+AskTechnology+realtech+homeautomation+HomeKit+singularity+technews+Entrepreneur+investing+BusinessHub+CareerSuccess+growmybusiness+venturecapital+ladybusiness+productivity+NFT+CryptoCurrency')
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')   

for subreddit in subs.new(limit = 500):
    date = subreddit.created
    date = datetime.datetime.fromtimestamp(date)
    posts.append([date, subreddit.subreddit, subreddit.title, subreddit.selftext])

df = pd.DataFrame(posts, columns = ['date', 'subreddit', 'title', 'text'])
df

Runtime with limit = 500 (500 rows): 7.630232095718384 s

I know they aren't doing exactly the same thing but, the only reason why I tried to implement this new code is to add the new columns 'subscribers' which seems to work differently for the other calls.

Any suggestions/improvement to suggest?

Last one, anyone knows a way to retrieve all subreddit list based on a specific subject? (Such as technology) I found this page that list subreddits: https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits/#wiki_technology

Thanks :)


Solution

  • Improving your existing code by reducing converting and server calls (with explanations at the end):

    posts = []
    
    subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
    'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft',  'AskTechnology', 'realtech', 
    'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess', 
    'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))
    
    # convert target date into epoch format
    targeted_date = '01-09-19 12:00:00'
    targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S').timestamp()
    
    
    for sub_name in subs:
        subscriber_number = reddit.subreddit(sub_name).subscribers
        if subscriber_number < 35000: # if the subscribers are less than this skip gathering the posts as this would have resulted in false originally
            continue
    
        for submission in reddit.subreddit(sub_name).hot(limit = 1):
            date = submission.created # reddit uses epoch time timestamps
            if date >= targeted_date:
                posts.append([date, submission.subreddit, subscriber_number, 
                          submission.title, submission.selftext])
    
    df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
    df
    

    By separating your logic AND gate you are able to skip over those loops that would evaluate to false.

    Instead of converting the date to a human-readable date inside of the for loop converting the target date once into the format that Reddit uses increases speed by removing the conversion operations and instead is just a look-up operation to compare numbers.

    By storing the result of the number of subscribers you remove the number of calls to retrieve that information and instead are looking up the number in memory.