My text file has 100's of entries like below.. I want my code to catch each event which has 14 or 15 elements seperated by delimiter ( | ) and put them in xml. Each event should be captured in new tag.
6354|,EGZ|2023012711283700|900|DDIC|S000|R_JR_BTCJOBS_GENERATOR||1|25737,00088,B5|SAP_WORKFLOW_WIM_ACTION/11283700&JOB_CLOSE&&&&|43AE5E5C16990580E0063BBEAE21BEA8|42010A2A25FA1EDDA7CN BDA81EE66224C|0000000000000000000000000000000000000\000000000000000000 6355|,EGZ|2023012711283700|900|DDIC|S000|R_JR_BTCJOBS_GENERATOR||1|25737,00088,B5|SAP_WORKFLOW_WIM_ACTION/11283700&JOB_CLOSE&&&&|43AE5E5C16990580E0063BBEAE21BEA8|42010A2A25FA1EDDA7CN BDA81EE66224C|0000000000000000000000000000000000000\000000000000000000s
Expected output is this:
</Processes>
<?xml version='1.0' encoding='utf-8'?>
<name>
<Time>6354</Time>
<Client>,EGZ</Client>
<User>2023012711283700</User>
<number>900</number>
<processid>DDIC</processid>
<program>S000</program>
<randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
<processidandwp></processidandwp>
<userclient>1</userclient>
<transactionid>25737,00088,B5</transactionid>
<additional1>text</additional1>
<additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
<additional3>42010A2A25FA1EDDA7CN</additional3>
<additional4>BDA81EE66224C</additional4>
<additional5>000000000000000000/00000000000</additional5>
</name>
<name>
<Time>6355</Time>
<Client>,EGZ</Client>
<User>2023012711283700</User>
<number>900</number>
<processid>DDIC</processid>
<program>S000</program>
<randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
<processidandwp></processidandwp>
<userclient>1</userclient>
<transactionid>25737,00088,B5</transactionid>
<additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
<additional3>42010A2A25FA1EDDA7CN</additional3>
<additional4>BDA81EE66224C</additional4>
<additional5>000000000000000000/00000000000</additional5>
</name>
</Processes>
The current output that I get is this
<?xml version='1.0' encoding='utf-8'?>
<Processes>
<name>
<Time>6354</Time>
<Client>,EGZ</Client>
<User>2023012711283700</User>
<number>900</number>
<processid>DDIC</processid>
<program>S000</program>
<randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
<processidandwp></processidandwp>
<userclient>1</userclient>
<transactionid>25737,00088,B5</transactionid>
<additional1>SAP_WORKFLOW_WIM_ACTION/</additional1>
<additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
<additional3>42010A2A25FA1EDDA7CN</additional3>
</name>
<name>
<Time>BDA81EE66224C</Time>
<Client>0000000000000000000000000000000000000\000000000000000000</Client>
</name>
<name>
<Time>6355</Time>
<Client>,EGZ</Client>
<User>2023012711283700</User>
<number>900</number>
<processid>DDIC</processid>
<program>S000</program>
<randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
<processidandwp></processidandwp>
<userclient>1</userclient>
<transactionid>25737,00088,B5</transactionid>
<additional1>SAP_WORKFLOW_WIM_ACTION/11</additional1>
<additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
<additional3>42010A2A25FA1EDDA7CN</additional3>
</name>
<name>
<Time>BDA81EE66224C</Time>
<Client>0000000000000000000000000000000000000\000000000000000000s</Client>
</name>
</Processes>
My code which i got is this:
import csv
import xml.etree.ElementTree as ET
row_names = [
'Time',
'Client',
'User',
'number',
'processid',
'program',
'randomnumber',
'processidandwp',
'userclient',
'transactionid',
'additional1',
'additional2',
'additional3',
'additional4'
]
root = ET.Element("Processes")
counter = 0
with open("data.csv", 'r') as file:
csv_reader = csv.reader(file, delimiter="|")
sub_root = ET.SubElement(root, 'name')
for row in csv_reader:
for name in row:
if counter < len(row_names) and name:
ele = ET.SubElement(sub_root, row_names[counter])
ele.text = name
counter += 1
ET.dump(root)
If you see my current output vs expected output, I want to have the expected output. For now...when the code reads the rows from the file, as soon as it reaches the 2nd row ( for the 1st event) or 4th row ( for the 2nd event) , it creates a new tag. Does it make sense?
Assuming an even number of lines in data.csv
. The following refactored python code may work for you. Combine every 2 lines into a single pipe-delimited record that is split into an array. For each item in the array build an XML node using the corresponding node name from the row_names
array.
import xml.etree.ElementTree as ET
import itertools
node_names = ['Time','Client','User','number','processid',
'program','randomnumber','processidandwp','userclient','transactionid',
'additional1','additional2','additional3','additional4','additional5']
root = ET.Element('Processes')
with open('data.csv') as f:
for l1,l2 in itertools.zip_longest(*[f]*2):
sub_root = ET.SubElement(root, 'name')
for idx, item in enumerate("".join([l1.strip(), l2.strip()]).split("|")):
ele = ET.SubElement(sub_root, node_names[idx])
ele.text = item
ET.indent(root, space=" ", level=0)
ET.dump(root)
Output:
<Processes>
<name>
<Time>6354</Time>
<Client>,EGZ</Client>
<User>2023012711283700</User>
<number>900</number>
<processid>DDIC</processid>
<program>S000</program>
<randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
<processidandwp />
<userclient>1</userclient>
<transactionid>25737,00088,B5</transactionid>
<additional1>SAP_WORKFLOW_WIM_ACTION/11283700&JOB_CLOSE&&&&</additional1>
<additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
<additional3>42010A2A25FA1EDDA7CNBDA81EE66224C</additional3>
<additional4>0000000000000000000000000000000000000\000000000000000000</additional4>
</name>
<name>
<Time>6355</Time>
<Client>,EGZ</Client>
<User>2023012711283700</User>
<number>900</number>
<processid>DDIC</processid>
<program>S000</program>
<randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
<processidandwp />
<userclient>1</userclient>
<transactionid>25737,00088,B5</transactionid>
<additional1>SAP_WORKFLOW_WIM_ACTION/11283700&JOB_CLOSE&&&&</additional1>
<additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
<additional3>42010A2A25FA1EDDA7CNBDA81EE66224C</additional3>
<additional4>0000000000000000000000000000000000000\000000000000000000s</additional4>
</name>
</Processes>
Verification that source data.csv
has 4 lines:
wc -l data.csv
4 data.csv