I have this code:
posts = []
subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft', 'AskTechnology', 'realtech',
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess',
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')
for sub_name in subs:
for submission in reddit.subreddit(sub_name).hot(limit = 1):
date = submission.created
date = datetime.datetime.fromtimestamp(date)
if date >= targeted_date and reddit.subreddit(sub_name).subscribers >= 35000:
posts.append([date, submission.subreddit, reddit.subreddit(sub_name).subscribers,
submission.title, submission.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df
Runtime with limit = 16 (~500 rows): 905.9099962711334 s
Which gives me this results:
date subreddit subscribers title text
0 2021-11-08 09:18:22 Bitcoin 3546142 Please upgrade your node to enable Taproot.
1 2021-09-19 17:01:03 homeautomation 1333753 Looking for developers interested in helping t... A while back I opened sourced all of my source...
2 2021-11-11 11:00:17 Entrepreneur 1036934 Thank you Thursday! - November 11, 2021 **Your opportunity to thank the** /r/Entrepren...
3 2021-11-08 01:36:05 oculus 396752 [Weekly] What VR games have you been enjoying ... Welcome to the weekly recommendation thread! :...
4 2021-06-17 19:25:01 microsoft 141810 Microsoft: Official Support Thread Microsoft: Official Support Thread\n\nMicrosof...
5 2021-11-12 11:02:14 investing 1946917 Daily General Discussion and spitballin thread... Have a general question? Want to offer some c...
6 2021-11-12 04:16:13 tech 413040 Mars rover scrapes at rock to 'look at somethi...
7 2021-11-12 12:00:15 wallstreetbets 11143628 Daily Discussion Thread for November 12, 2021 Your daily trading discussion thread. Please k...
8 2021-04-17 14:50:02 singularity 134940 Re: The Discord Link Expired, so here's a new ...
9 2021-11-12 11:40:04 programming 3682438 It's probably time to stop recommending Clean ...
10 2021-09-10 10:26:07 software 149655 What I do/install on every Windows PC - Softwa... Hello, I have to spend a lot of time finding s...
11 2021-11-12 13:00:18 Android 2315799 Daily Superthread (Nov 12 2021) - Your daily t... Note 1. Check [MoronicMondayAndroid](https://o...
12 2021-11-11 23:32:33 CryptoCurrency 3871810 Live Recording: Kevin O’Leary Talks About Cryp...
13 2021-11-02 20:53:21 productivity 874076 Self-promotion/shout out thread This is the place to share your personal blogs...
14 2021-11-12 14:57:19 RenewableEnergy 97364 Northvolt produces first fully recycled batter...
15 2021-11-12 08:00:16 gaming 30936297 Free Talk Friday! Use this post to discuss life, post memes, or ...
16 2021-11-01 05:01:23 startups 884574 Share Your Startup - November 2021 - Upvote Th... [r/startups](https://www.reddit.com/r/startups...
17 2021-11-01 09:00:11 HomeKit 107076 Monthly Buying Megathread - Ask which accessor... Looking for lights, a thermostat, a plug, or a...
18 2021-11-01 13:00:13 dataisbeautiful 16467198 [Topic][Open] Open Discussion Thread — Anybody... Anybody can post a question related to data vi...
19 2021-11-12 12:29:47 technews 339611 Peter Jackson sells visual effects firm for $1...
20 2021-10-07 19:15:14 NFT 221897 Join our official —and the #1 NFT— Discord Ser...
21 2020-12-01 12:11:36 google 1622449 Monthly Discussion and Support Thread - Decemb... Have a question you need answered? A new Googl...
The issue is that it's taking way too much time. As you can see I set up a limit = 1 and it takes approx 1 min in to run. Yesterday, I set up the limit to 300, in order to analyze the data and it run for about 2 hours.
My question: Is there a way to change the code organization in order to limit the run time?
The bellow code used to work way faster, but I wanted to had a column subscriber number, and had to add a second for loop:
posts = []
subs = reddit.subreddit('Futurology+wallstreetbets+DataIsBeautiful+RenewableEnergy+Bitcoin+Android+programming+gaming+tech+google+hardware+oculus+software+startups+linus+microsoft+AskTechnology+realtech+homeautomation+HomeKit+singularity+technews+Entrepreneur+investing+BusinessHub+CareerSuccess+growmybusiness+venturecapital+ladybusiness+productivity+NFT+CryptoCurrency')
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S')
for subreddit in subs.new(limit = 500):
date = subreddit.created
date = datetime.datetime.fromtimestamp(date)
posts.append([date, subreddit.subreddit, subreddit.title, subreddit.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit', 'title', 'text'])
df
Runtime with limit = 500 (500 rows): 7.630232095718384 s
I know they aren't doing exactly the same thing but, the only reason why I tried to implement this new code is to add the new columns 'subscribers' which seems to work differently for the other calls.
Any suggestions/improvement to suggest?
Last one, anyone knows a way to retrieve all subreddit list based on a specific subject? (Such as technology) I found this page that list subreddits: https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits/#wiki_technology
Thanks :)
Improving your existing code by reducing converting and server calls (with explanations at the end):
posts = []
subs = list(set(['Futurology', 'wallstreetbets', 'DataIsBeautiful','RenewableEnergy', 'Bitcoin', 'Android', 'programming',
'gaming','tech', 'google','hardware', 'oculus', 'software', 'startups', 'linus', 'microsoft', 'AskTechnology', 'realtech',
'homeautomation', 'HomeKit','singularity', 'technews','Entrepreneur', 'investing', 'BusinessHub', 'CareerSuccess',
'growmybusiness','venturecapital', 'ladybusiness', 'productivity', 'NFT', 'CryptoCurrency']))
# convert target date into epoch format
targeted_date = '01-09-19 12:00:00'
targeted_date = datetime.datetime.strptime(targeted_date, '%d-%m-%y %H:%M:%S').timestamp()
for sub_name in subs:
subscriber_number = reddit.subreddit(sub_name).subscribers
if subscriber_number < 35000: # if the subscribers are less than this skip gathering the posts as this would have resulted in false originally
continue
for submission in reddit.subreddit(sub_name).hot(limit = 1):
date = submission.created # reddit uses epoch time timestamps
if date >= targeted_date:
posts.append([date, submission.subreddit, subscriber_number,
submission.title, submission.selftext])
df = pd.DataFrame(posts, columns = ['date', 'subreddit','subscribers', 'title', 'text'])
df
By separating your logic AND gate you are able to skip over those loops that would evaluate to false.
Instead of converting the date to a human-readable date inside of the for loop converting the target date once into the format that Reddit uses increases speed by removing the conversion operations and instead is just a look-up operation to compare numbers.
By storing the result of the number of subscribers you remove the number of calls to retrieve that information and instead are looking up the number in memory.