Single out tags in an xml document?

I have what I believe to be a fairly simple issue.

I've retrieved a file from gdata, this file: https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments

I'm attempting to single out the tex t between the

"< author >HERE< /author >"

tags so i'll be left with an output containing only usernames. Is python even the best way to go about this or should I use another language? I've been googling since 8:00am (4hrs) and i've yet to find anything for such a seemingly easy task.

Best regards, - Mitch Powell

Solution

You have an atom feed there, so I'd use feedparser to handle that:

import feedparser

result = feedparser.parse('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
for entry in result.entries:
    print entry.author

This prints:

FreebieFM
micromicros
FreebieFM
Sarah Grimstone
FreebieFM
# etc.

Feedparser is an external library, but easily installed. If you have to use only the standard library, you could use the ElementTree API, but to parse the Atom feed you need to include HTML entities in the parser, and you'll have to deal with namespaces (not ElementTree's strong point):

from urllib2 import urlopen
from xml.etree import ElementTree

response = urlopen('https://gdata.youtube.com/feeds/api/videos/Ej4_G-E1cAM/comments')
tree = ElementTree.parse(response)

nsmap = {'a': 'http://www.w3.org/2005/Atom'}
for author in tree.findall('.//a:author/a:name', namespaces=nsmap):
    print author.text

The nsmap dictionary lets ElementTree translate the a: prefix to the correct namespace for those elements.