Search code examples
pythonhtmlweb-scrapingbeautifulsoupforums

Scraping a Forum with Beautiful Soup - How to exclude Quoted Replies?


it's my first time using Beautiful Soup or doing Web Scraping for that matter. I have been happy enough with how far I've gotten so far but I have come to a bit of an obstacle.

I am trying to scrape all the posts on a particular thread. However, I want to exclude the text from quoted replies.

An example:

I would like scrape the text from these postings without scraping the text within the area indicated in the red box.

In the html, the part I want to exclude is within the section I need to select for the message which is why I am having difficulty. I have included a screenshot of the html

HTML image

<div id="post_message_39096267"><!-- google_ad_section_start --><div style="margin:20px; margin-top:5px; ">
<div class="smallfont" style="margin-bottom:2px">Quote:</div>
<table cellpadding="6" cellspacing="0" border="0" width="100%">
<tbody><tr>
    <td class="alt2" style="border:1px inset">

            <div>
                Originally Posted by <strong>SAAN</strong>
                <a href="http://www.city-data.com/forum/economics/2056372-minimum-wage-vs-liveable-wage-post33645660.html#post33645660" rel="nofollow"><img class="inlineimg li fs-viewpost" src="http://pics3.city-data.com/trn.gif" border="0" alt="View Post" title="View Post"></a>
            </div>
            <div style="font-style:italic">I agree with trying to buy a 
cheap car outright, the problem is everyone I know that has done that $2-
5000 car, always ended up with these huge repair bills that are equivalent 
to car payments.  Most cars after 100K will need all sort of regulatr 
maintance that is easily a $200 repair to go along with anything that may 
break which is common with cars as they age.<br>
<br>
I have a 2yr old im making payments on and 14yr old car that is paid off, 
but needs $2000 in maintenance.  When car shopping this summer, I saw many 
cars i could buy outright, but after adding u everything needed to make sure 
it needs nothing, your back into the price range of a car payment.</div>

    </td>
</tr>
</tbody></table>
</div>Depends on how long the car loan would be stretched. Just because you 
can get an 8 year loan and reduce payments to a level like the repairs on 
your old car doesn't make it a good idea, especially for new cars that <a 
href="/knowledge/Depreciation.html" title="View 'depreciate' definition from 
Wikipedia" class="knldlink" rel="nofollow">depreciate</a> quickly. You'd 
just be putting yourself into negative equity territory.<!-- 
google_ad_section_end --></div>

I have included my code below: Hopefully this will help you understand what I am talking about.

from bs4 import BeautifulSoup
import urllib2


num_pages = 101
page_range = range(1,num_pages+1)
clean_posts = []

for page in page_range:
  print("Reading page: ", page, "...")
  if page == 1:
    page_url = urllib2.urlopen('http://www.city-data.com/forum/economics/2056372-minimum-wage-vs-liveable-wage.html')
  else:
    page_url = urllib2.urlopen('http://www.city-data.com/forum/economics/2056372-minimum-wage-vs-liveable-wage'+'-'+str(page)+'.html')


soup = BeautifulSoup(page_url)

postData = soup.find_all("div", id=lambda value: value and value.startswith("post_message_"))

posts = []
for post in postData:
    posts.append(BeautifulSoup(str(post)).get_text().encode("utf-8").strip().replace("\t", ""))

posts_stripped = [x.replace("\n","") for x in posts]

clean_posts.append(posts_stripped)

Finally, I would hugely appreciate it if you could give me code examples of what would work and explain what things to me as if I were literally 9 years old!

Cheers Diarmaid


Solution

  • Check if your post_message_ div has another div inside (The quote div). If so extract it. Append the original div (post_message_) text into your list. Replace your for post in postData with this one.

    posts = []
    for post in postData:
        hasQuote = post.find("div")
         if not hasQuote is None:
            hasQuote.extract()
        posts.append(post.get_text(strip=True))