I'm pretty new in programming and it is the first time I use xml, but for class I'm doing a gender classification project with a dataset of Blogs. I have a folder which consists of xml files. Now I need to make a list of names of the files there. Then I should be able to run through the list with a loop and open each file containing XML and get out of it what I want (ex. Text and class) and then store that in another variable, like adding it to a list or dictionary.
I tried something, but it isn't right and I'm kind of stuck. Can someone help me? This is wat I have so far:
path ='\\Users\\name\\directory\\folder'
dir = os.listdir( path )
def select_files_in_folder(dir, ext):
for filename in os.listdir(path):
fullname= os.path.join(path, filename)
tree = ET.parse(fullname)
for elem in doc.findall('gender'):
print(elem.get('gender'), elem.text)
If you want to build a list of all the xml files in a given directory you can do the following
def get_xml_files(path):
xml_list = []
for filename in os.listdir(path):
if filename.endswith(".xml"):
xml_list.append(os.path.join(path, filename))
return xml_list
just keep in mind that this is not recursive through the folders and it's just assuming that the xml files finish with .xml.
EDIT :
Parsing xml is highlly dependent of the library you'll be using. From your code I guess you're using xml.etree.ElementTree (keep in mind this lib is not safe against maliciously constructed data).
def get_xml_data(list):
data = []
for filename in list :
root = ET.parse(filename)
data = [ text for text in root.findall("whatever you want to get") ]
return data