Search code examples
pythonregexrsscraigslist

Using python regex to extract addresses from Craigslist rss feed


I'm pulling my hair out trying to parse out a craigslist rss feed to extract location information.

I used feedparser to parse the script into into entries and entry descriptions. Unfortunately the address information is contained in irregular tags within the description section.

the addresses are contained in a section that looks like this:

<!-- CLTAG xstreet0=11832 se 318pl  -->
<!-- CLTAG xstreet1= -->
<!-- CLTAG city=auburn -->
<!-- CLTAG region=wa -->
11832 se 318pl 

Feedparser doesn't like those CLTAGS. My attempt to capture the first line with regex looked like this:

addressStart = r'!-- CLTAG xstreet0='
addressEnd = r'-->'

prog = re.compile(addressStart(.*?)addressEnd)
result = prog.match(string)

...But that didn't work. What am I doing wrong? here is a link to the rss feed I'm working with 'http://seattle.craigslist.org/see/apa/index.rss'

Any help is greatly appreciated!


Solution

  • That's some invalid syntax. You cannot concatenate/format strings unless the strings are quoted. Try:

    addressStart = r'!-- CLTAG xstreet0='
    addressEnd = r'-->'
    
    prog = re.compile(addressStart + r'(.*?)' + addressEnd)
    result = prog.match(string)