Search code examples
pythonxmlpython-3.xparsingdata-processing

Iterate through Huge XML File and get the Value?


I wanna Iterate through the Users Stackoverflow dump File. The problem is it is very Huge and it is XML. For me xml is a new Topic. I read several Documentation and Stackoverflow Post but for some reason it doesn't work.

XML Format:

<users>
  <row Id="-1" Reputation="1" 
  CreationDate="2008-07-31T00:00:00.000" 
  DisplayName="Community" 
  LastAccessDate="2008-08-26T00:16:53.810" 
  WebsiteUrl="http://meta.stackexchange.com/" 
  Location="on the server farm" AboutMe="&lt;p&gt;Hi, I'm not really a person.&" Views="649" UpVotes="245983" DownVotes="924377" AccountId="-1" 
  />
</users>

The Code:

from xml.etree.ElementTree import iterparse

for evt, elem in iterparse('data/Users.xml', events=('start','end')):
    print(evt, elem)

What I get:

The For Loop outprint me a bunch of hexacode. And in the End I get an Memory Exception. Maybe its normal because I try it a second time and it iterate the xml very fast 0.13 seconds

start <Element 'row' at 0x04CC16F0>
end <Element 'row' at 0x04CC16F0>
start <Element 'row' at 0x04CC1810>

I hope you guys can help by the Question. How I get the Value of this Output? I wanna save it into SQL.

All of the File is 199 GB (Badge,Comment,PostLinks,PostHistory,Users,Posts,Tags and Votes). The Users.xml specific for this Question is 2,49 GB. But I wanna put all of the Data From SO into the database.

Yours faithfully

HanahDevelope


Solution

  • It looks like you just need to loop through the end event for all row elements and do something with the attributes:

    from xml.etree.ElementTree import iterparse
    
    for evt, elem in iterparse('data/Users.xml', events=('end',)):
        if elem.tag == 'row':
            user_fields = elem.attrib
            print(user_fields)
    

    This will output:

    {'DisplayName': 'Community', 'Views': '649', 'DownVotes': '924377', 'LastAccessDate': '2008-08-26T00:16:53.810', 'Id': '-1', 'WebsiteUrl': 'http://meta.stackexchange.com/', 'Reputation': '1', 'Location': 'on the server farm', 'UpVotes': '245983', 'CreationDate': '2008-07-31T00:00:00.000', 'AboutMe': "<p>Hi, I'm not really a person.", 'AccountId': '-1'}