Search code examples
pythonrss-readerfeedparsergoogle-groupsgoogle-groups-api

Getting public google group topic titles using RSS feed


I am trying to get the title of all the topics in a public google group from RSS feed. I am trying to get All the topic titles. It has almost 8000 topics. I am using the following code to read feeds using feedparser.

import feedparser
url = 'https://groups.google.com/forum/feed/caffe-users/topics/rss_v2_0.xml?num=50'
feed = feedparser.parse(url)
for entry in feed['entries']:
    content = entry['title']
    print(content)

I notice when i use num = 50 i get all the 50 titles. But when I change num = 50 to num = 8000 or even num = 500 i only see only 15 titles? The output is something like following:

15
"Invalid integer constant expression" Error during Installation
Can't complete make pycaffe (Python.h not found)
Kernels not compiling with Vienna-CL for openCL Intel build on Centos 7
"import caffe" failed
Frozen training model -  Reading dangerously large protocol message ?
Specifying the solver file parameters
Intel MKL FATAL ERROR: Cannot load mkl_intel_thread.dll.
Making the network shorter, adding dropout and augmenting the dataset produce overfitting, why?
Fwd: [Scala.js] Fwd: Us congress hearing of maan alsaan Money laundry قضية الكونغجرس لغسيل الأموال للمليادير معن الصانع
Feature maps from network for multiple images all the same
How to interpret the result of Ristretto?
how do I train DB with 3~10 features per image ?
Recompile with -fPIC
scaling the pixels  in deployment.prototxt in [0,1]
hi im installing caffe and i have this error

Any idea why is this happening? I get 50 titles when num = 50 but why the fetched title is decreased to a fixed number 15 when i increase the value of num? Any help or suggestion will be appreciated. Thanks. With the library gggd I am facing the following problem:

atan-115b-02:src mislam$ ./gggd.py -l -C cookies.txt caffe-users Please log in to your Google groups account (navigate the form fields with up and down arrows, submit form with Enter) and then exit the browser (using the 'q' key). Press Enter to continue.

Alert!: This client does not contain support for HTTPS URLs.

lynx: Can't access startfile https://www.google.com/a/UniversalLogin?continue=https://groups.google.com/forum/&service=groups2&hd=default gggd.py: ValueError("invalid literal for int() with base 10: 'client'",) for help use --help


Solution

  • To download all of the messages from this Google Group you'll need to use another interface than RSS. The Google Groups RSS interface will only send up to 50 of the most recent messages. There is no pagination or date support, so you can't use the RSS interface to acquire all of the messages from the group.

    Solution

    Get Google Groups Data is a Python2 project that crawls a specified Google Group and downloads all of its messages. After installing lynx on my Mac I was able to scrape the caffe-users forum indicated in your source code.

    Screenshot below. Good luck.

    enter image description here