Search code examples
pythonxmlgdata

Single out tags in an xml document?


I have what I believe to be a fairly simple issue.

I've retrieved a file from gdata, this file: https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments

I'm attempting to single out the tex t between the

"< author >HERE< /author >" 

tags so i'll be left with an output containing only usernames. Is python even the best way to go about this or should I use another language? I've been googling since 8:00am (4hrs) and i've yet to find anything for such a seemingly easy task.

Best regards, - Mitch Powell


Solution

  • You have an atom feed there, so I'd use feedparser to handle that:

    import feedparser
    
    result = feedparser.parse('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
    for entry in result.entries:
        print entry.author
    

    This prints:

    FreebieFM
    micromicros
    FreebieFM
    Sarah Grimstone
    FreebieFM
    # etc.
    

    Feedparser is an external library, but easily installed. If you have to use only the standard library, you could use the ElementTree API, but to parse the Atom feed you need to include HTML entities in the parser, and you'll have to deal with namespaces (not ElementTree's strong point):

    from urllib2 import urlopen
    from xml.etree import ElementTree
    
    response = urlopen('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
    tree = ElementTree.parse(response)
    
    nsmap = {'a': 'http://www.w3.org/2005/Atom'}
    for author in tree.findall('.//a:author/a:name', namespaces=nsmap):
        print author.text
    

    The nsmap dictionary lets ElementTree translate the a: prefix to the correct namespace for those elements.