python algorithm python-3.x beautifulsoup similarity

Detecting similar posts or ads on Craigslist

I'd like to scrape Craigslist for apartments in a certain region, storing key data like rent, location, etc. in a database (probably sqlite—I haven't decided). I'm new to Python, but found it very easy to use requests and BeautifulSoup to do the scraping, e.g.

#!/usr/bin/python

from bs4 import BeautifulSoup
import requests

r = requests.get("http://sandiego.craigslist.org/apa/")
data = r.text

soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))

which outputs

https://post.craigslist.org/c/sdo?lang=en
https://accounts.craigslist.org
#
//www.craigslist.org/about/sites
/
/hhh/
/apa/
/csd/apa/
/nsd/apa/
/esd/apa/
/ssd/apa/
#list
#pic
#grid
#map
/apa/index100.html
/search/apa/?sort=priceasc
/search/apa/?sort=pricedsc
/csd/apa/4481946343.html
/csd/apa/4481946343.html
/csd/apa/4481860479.html
/csd/apa/4481860479.html
/ssd/apa/4481935551.html
/ssd/apa/4481935551.html
/csd/apa/4437743340.html
/csd/apa/4437743340.html
...

Often, posters repost their ads, usually "legally" (e.g. at least a week has passed), but sometimes by making slight modifications to the ad. I'd like to be able to draw on some sort of cache of ads my scraper has seen, to flag such ads. What should I cache in order to detect similar posts?

I realize there's more than one way to do this, but I'd like to know if the Python community has its own, "Python way" of solving the problem. For example, perhaps there's already a module for doing this sort of thing (with HTML pages, nonetheless).

Given no response, my plan was to take each ad and store (1) the full text and (2) an MD5 hash of each image, associating each piece of data with the post id (e.g. 4481946343) of the ad, and then devising some heuristic for judging similarity, e.g. "At least one image hash matches, or, there is a 95% match among words with 5 or more letters." But this especially is where I didn't feel comfortable creating my own solution; I thought there must be a better, perhaps even canonical, way.

^{By the way, I've read about third-party APIs like 3TAPS, but also I read that CL has filed lawsuits against such services (and won); besides, my project is simple enough that I prefer transparency and coding without such dependencies.}

Solution

I think using a hash is not a good idea because hash functions like MD5 are designed to cover the codomain as uniformly as possible, so there will be no notion of similarity between hashed values even if it's just one bit that makes the difference. there are exceptions, however.

you might want to look at some simple text classification methods. the most simple thing to do would be to build up a bag-of-words than will result in a numeric vector respresentation for every ad text. you can then calculate similarity using k-nearest neighbors, cosine similarity or whatever the machine learning literature has to offer you.