Search code examples
pythonxmlwebdownloadscreen-scraping

Python web scraping - Download a file and store all data in xml


I am looking to use Python to scrape some data from my university's intranet and download all the research papers. I have looked at Python scraping before, but haven't really done any myself I'm sure I read about a Python scraping framework somewhere, should I use that?

So in essence this is what I need to scrape:

  1. Authors
  2. Description
  3. Field
  4. Then download the file and rename with the paper name.

I will then either put all this in xml or a database, most probably xml and then develop an interface etc at a later date.

Is this doable? Any ideas on where I should start?

Thanks in advance, LukeJenx

EDIT: The framework is Scrapy

EDIT: Turns out that I nearly killed the server today so a lecturer is getting me the copies from the Network team for me... Thanks!


Solution

  • Scrapy is a great framework, and has really good documentation as well. You should start there.

    If you don't know XPaths, I'd recommend you learn them if you plan to use Scrapy (they're extremely easy!). XPaths help you "locate" specific elements inside the html that you'd want to extract.

    Scrapy already has a built-in command line argument to export to xml, csv, etc. i.e. scrapy crawl <spidername> -o <filename> -t xml

    Mechanize is another great option for writing scrapers easily.