Search code examples
beautifulsoupscrapyelementtreeminidomcelementtree

For web scraping and xml parsing, which is best library to learn


I am getting confused with multiple libraries for the same work. I want to learn to one library which will handle both xml and html parsing. Do elementtree is compatible for html parsing. I heard about lxml, xml.elementtree, beautifulsoup, minidom, scrapy. Can anybody help me.


Solution

  • Scrapy is used for scraping web pages (extracting data from web pages) hence the name.

    Beautiful Soup is library for parsing/pulling data from XML and HTML files.

    xml.elementtree provides object representation of the XML file and it is a XML processing module of Python XML package. It is neat to use for parsing and manipulating data in XML format.

    lxml is as they claim compatible yet superior to elementtree of the Python XML module but essentially does the same however, I never used it for parsing of HTML files.

    In my experience I used Scrapy for fetching data from various user panels that did not have any kind of API for pulling the data. However, parsing of HTML files I mostly did with Beautiful Soup as it is really neat and easy to use. Regarding XML parsing I mostly used Python XML package however, I never had any complicated XML parsing to perform so Python XML package covered everything I need.

    The right tool really depends on your requirements. If you need library to parse XML and HTML files both I would go with Beautiful Soup as it is really easy to use and you have vast documentation online.