Search code examples
pythonxmlnlptext-classification

xml files from folder into list


I'm pretty new in programming and it is the first time I use xml, but for class I'm doing a gender classification project with a dataset of Blogs. I have a folder which consists of xml files. Now I need to make a list of names of the files there. Then I should be able to run through the list with a loop and open each file containing XML and get out of it what I want (ex. Text and class) and then store that in another variable, like adding it to a list or dictionary.

I tried something, but it isn't right and I'm kind of stuck. Can someone help me? This is wat I have so far:

path ='\\Users\\name\\directory\\folder'
dir = os.listdir( path )
def select_files_in_folder(dir, ext):
    for filename in os.listdir(path):
        fullname= os.path.join(path, filename)
        tree = ET.parse(fullname)
    for elem in doc.findall('gender'):
        print(elem.get('gender'), elem.text)

Solution

  • If you want to build a list of all the xml files in a given directory you can do the following

    def get_xml_files(path):
        xml_list = []
        for filename in os.listdir(path):
            if filename.endswith(".xml"):
                xml_list.append(os.path.join(path, filename))
        return xml_list
    

    just keep in mind that this is not recursive through the folders and it's just assuming that the xml files finish with .xml.

    EDIT :

    Parsing xml is highlly dependent of the library you'll be using. From your code I guess you're using xml.etree.ElementTree (keep in mind this lib is not safe against maliciously constructed data).

    def get_xml_data(list):
        data = []
        for filename in list :
            root = ET.parse(filename)
            data = [ text for text in root.findall("whatever you want to get") ]
        return data