python xml python-3.x parsing data-processing

Iterate through Huge XML File and get the Value?

I wanna Iterate through the Users Stackoverflow dump File. The problem is it is very Huge and it is XML. For me xml is a new Topic. I read several Documentation and Stackoverflow Post but for some reason it doesn't work.

XML Format:

<users>
  <row Id="-1" Reputation="1" 
  CreationDate="2008-07-31T00:00:00.000" 
  DisplayName="Community" 
  LastAccessDate="2008-08-26T00:16:53.810" 
  WebsiteUrl="http://meta.stackexchange.com/" 
  Location="on the server farm" AboutMe="&lt;p&gt;Hi, I'm not really a person.&" Views="649" UpVotes="245983" DownVotes="924377" AccountId="-1" 
  />
</users>

The Code:

from xml.etree.ElementTree import iterparse

for evt, elem in iterparse('data/Users.xml', events=('start','end')):
    print(evt, elem)

What I get:

The For Loop outprint me a bunch of hexacode. And in the End I get an Memory Exception. Maybe its normal because I try it a second time and it iterate the xml very fast 0.13 seconds

start <Element 'row' at 0x04CC16F0>
end <Element 'row' at 0x04CC16F0>
start <Element 'row' at 0x04CC1810>

I hope you guys can help by the Question. How I get the Value of this Output? I wanna save it into SQL.

All of the File is 199 GB (Badge,Comment,PostLinks,PostHistory,Users,Posts,Tags and Votes). The Users.xml specific for this Question is 2,49 GB. But I wanna put all of the Data From SO into the database.

Yours faithfully

HanahDevelope

Solution

It looks like you just need to loop through the end event for all row elements and do something with the attributes:

from xml.etree.ElementTree import iterparse

for evt, elem in iterparse('data/Users.xml', events=('end',)):
    if elem.tag == 'row':
        user_fields = elem.attrib
        print(user_fields)

This will output:

{'DisplayName': 'Community', 'Views': '649', 'DownVotes': '924377', 'LastAccessDate': '2008-08-26T00:16:53.810', 'Id': '-1', 'WebsiteUrl': 'http://meta.stackexchange.com/', 'Reputation': '1', 'Location': 'on the server farm', 'UpVotes': '245983', 'CreationDate': '2008-07-31T00:00:00.000', 'AboutMe': "<p>Hi, I'm not really a person.", 'AccountId': '-1'}