Search code examples
web-scrapingxml-parsingweb-crawler

Looking for an Open Source Web Crawler that can crawl API requests and parse XML into csv


I'm looking into webcrawlers to crawl through an API and parse the XML into an XML or CSV file.

I've been playing around with requests from some API feeds but it would be great if I didn't have to do it manually and use something to do it automatically and edit the data later.

For example using the API for a site called eventful, I can request an "?XML feed?" of data

http://api.eventful.com/rest/events/search?app_key=LksBnC8MgTjD4Wc5&location=pittsburgh&date=Future

If you inspect the link you can see there is a ton of XML data sent back.

I thought that since the XML data is already broken down by elements it wouldn't be as difficult to ask the crawler to handle the sorting (e.g the city element would send all data to a city field in the CSV document).

I'm wondering if anyone has used an existing opensource web crawler to crawl APIs and relate that parsed data into a Excel like format....

I looked into Nutch but I couldn't find any reference in the documentation to sorting an XML return into a Excel like document based on the elements returned by the API feed.

Has anyone done anything like this before and can you refer a program. Specifics would be really helpful.


Solution

  • I found a paid solution called Mozenda.....

    I'll update if I can find something opensource