Search code examples
.netpythonironpython

Iron python, beautiful soup, win32 app


Does beautiful soup work with iron python? If so with which version of iron python? How easy is it to distribute a windows desktop app on .net 2.0 using iron python (mostly c# calling some python code for parsing html)?


Solution

  • I was asking myself this same question and after struggling to follow advice here and elsewhere to get IronPython and BeautifulSoup to play nicely with my existing code I decided to go looking for an alternative native .NET solution. BeautifulSoup is a wonderful bit of code and at first it didn't look like there was anything comparable available for .NET, but then I found the HTML Agility Pack and if anything I think I've actually gained some maintainability over BeautifulSoup. It takes clean or crufty HTML and produces a elegant XML DOM from it that can be queried via XPath. With a couple lines of code you can even get back a raw XDocument and then craft your queries in LINQ to XML. Honestly, if web scraping is your goal, this is about the cleanest solution you are likely to find.

    Edit

    Here is a simple (read: not robust at all) example that parses out the US House of Representatives holiday schedule:

    using System;
    using System.Collections.Generic;
    using HtmlAgilityPack;
    
    namespace GovParsingTest
    {
        class Program
        {
            static void Main(string[] args)
            {
                HtmlWeb hw = new HtmlWeb();
                string url = @"http://www.house.gov/house/House_Calendar.shtml";
                HtmlDocument doc = hw.Load(url);
    
                HtmlNode docNode = doc.DocumentNode;
                HtmlNode div = docNode.SelectSingleNode("//div[@id='primary']");
                HtmlNodeCollection tableRows = div.SelectNodes(".//tr");
    
                foreach (HtmlNode row in tableRows)
                {
                    HtmlNodeCollection cells = row.SelectNodes(".//td");
                    HtmlNode dateNode = cells[0];
                    HtmlNode eventNode = cells[1];
    
                    while (eventNode.HasChildNodes)
                    {
                        eventNode = eventNode.FirstChild;
                    }
    
                    Console.WriteLine(dateNode.InnerText);
                    Console.WriteLine(eventNode.InnerText);
                    Console.WriteLine();
                }
    
                //Console.WriteLine(div.InnerHtml);
                Console.ReadKey();
            }
        }
    }