Search code examples
pythonregexweb-scrapingscrapyforums

Extraction of specific fields from a thread in a forum


I am working on a data-mining project for which I need to analyse the progress of discussion in a thread of a forum. I am interested in extracting information like time of post, stats of post's author (no. of posts, joining date, etc.), text of the post, etc.

However while using standard scraping tools (like Scrapy in python) I need to write the regular expressions for detecting these fields in the page's html source. As these tags vary with the type of forum, it is becoming a major problem to tackle the regular expressions for every forum. Is there a standard bank of such regular expressions available, so that they can be used based on the type of forum?

Or is there any other technique to extract these fields from the forum's page.


Solution

  • I wrote some configuration files for some major forums. Hope you can decipher and infer how to parse it.

    For VBulletin:

    enclosed_section=tag:table,attributes:id;threadslist
    thread=tag:a,attributes:id;REthread_title_
    list_next_page=type:next_page,attributes:anchor_text;>
    post=tag:div,attributes:id;REpost_message_
    thread_next_page=type:next_page,attributes:anchor_text;>
    

    enclosed_section is the div that contains links to all the threads thread is where you'll find the link to each thread list_next_page is the link to the next page with list of threads post is the div with the post text. thread_next_page is the link to the next page of the thread

    For Invision:

    enclosed_section=tag:table,attributes:id;forum_table
    thread=tag:a,attributes:class;topic_title
    list_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
    post=tag:div,attributes:class;post entry-content |
    thread_next_page=tag:a,attributes:rel;next,inside_tag_attribute:href
    post_count_section=tag:td,attributes:class;stats
    post_count=tag:li,attributes:,reg_exp:(\d+) Repl