Search code examples
pythonpython-3.xxmlcsv

pick up 2 rows from csv and convert to xml


My text file has 100's of entries like below.. I want my code to catch each event which has 14 or 15 elements seperated by delimiter ( | ) and put them in xml. Each event should be captured in new tag.

6354|,EGZ|2023012711283700|900|DDIC|S000|R_JR_BTCJOBS_GENERATOR||1|25737,00088,B5|SAP_WORKFLOW_WIM_ACTION/11283700&JOB_CLOSE&&&&|43AE5E5C16990580E0063BBEAE21BEA8|42010A2A25FA1EDDA7CN BDA81EE66224C|0000000000000000000000000000000000000\000000000000000000 6355|,EGZ|2023012711283700|900|DDIC|S000|R_JR_BTCJOBS_GENERATOR||1|25737,00088,B5|SAP_WORKFLOW_WIM_ACTION/11283700&JOB_CLOSE&&&&|43AE5E5C16990580E0063BBEAE21BEA8|42010A2A25FA1EDDA7CN BDA81EE66224C|0000000000000000000000000000000000000\000000000000000000s

Expected output is this:
</Processes>
 <?xml version='1.0' encoding='utf-8'?>
  <name>
   <Time>6354</Time>
   <Client>,EGZ</Client>
   <User>2023012711283700</User>
   <number>900</number>
   <processid>DDIC</processid>
   <program>S000</program>
   <randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
   <processidandwp></processidandwp>
   <userclient>1</userclient>
   <transactionid>25737,00088,B5</transactionid>
   <additional1>text</additional1>
   <additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
   <additional3>42010A2A25FA1EDDA7CN</additional3>
   <additional4>BDA81EE66224C</additional4>
   <additional5>000000000000000000/00000000000</additional5>
  </name>
  <name>
   <Time>6355</Time>
   <Client>,EGZ</Client>
   <User>2023012711283700</User>
   <number>900</number>
   <processid>DDIC</processid>
   <program>S000</program>
   <randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
   <processidandwp></processidandwp>
   <userclient>1</userclient>
   <transactionid>25737,00088,B5</transactionid>
   <additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
   <additional3>42010A2A25FA1EDDA7CN</additional3>
   <additional4>BDA81EE66224C</additional4>
   <additional5>000000000000000000/00000000000</additional5>
  </name>
 </Processes>

The current output that I get is this
 <?xml version='1.0' encoding='utf-8'?>
 <Processes>
  <name>
  <Time>6354</Time>
  <Client>,EGZ</Client>
  <User>2023012711283700</User>
  <number>900</number>
  <processid>DDIC</processid>
  <program>S000</program>
  <randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
  <processidandwp></processidandwp>
  <userclient>1</userclient>
  <transactionid>25737,00088,B5</transactionid>
  <additional1>SAP_WORKFLOW_WIM_ACTION/</additional1>
  <additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
  <additional3>42010A2A25FA1EDDA7CN</additional3>
 </name>
 <name>
  <Time>BDA81EE66224C</Time>
  <Client>0000000000000000000000000000000000000\000000000000000000</Client>
 </name>
 <name>
  <Time>6355</Time>
  <Client>,EGZ</Client>
  <User>2023012711283700</User>
  <number>900</number>
  <processid>DDIC</processid>
  <program>S000</program>
  <randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
  <processidandwp></processidandwp>
  <userclient>1</userclient>
  <transactionid>25737,00088,B5</transactionid>
  <additional1>SAP_WORKFLOW_WIM_ACTION/11</additional1>
  <additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
  <additional3>42010A2A25FA1EDDA7CN</additional3>
 </name>
 <name>
  <Time>BDA81EE66224C</Time>
  <Client>0000000000000000000000000000000000000\000000000000000000s</Client>
 </name>
</Processes>

My code which i got is this:
import csv
import xml.etree.ElementTree as ET

row_names = [
 'Time',
 'Client',
 'User',
 'number',
 'processid',
 'program',
 'randomnumber',
 'processidandwp',
 'userclient',
 'transactionid',
 'additional1',
 'additional2',
 'additional3',
 'additional4'
]
root = ET.Element("Processes")
counter = 0
with open("data.csv", 'r') as file:
 csv_reader = csv.reader(file, delimiter="|")
 sub_root = ET.SubElement(root, 'name')
 for row in csv_reader:
    for name in row:
        if counter < len(row_names) and name:
            ele = ET.SubElement(sub_root, row_names[counter])
            ele.text = name
            counter += 1

ET.dump(root)

If you see my current output vs expected output, I want to have the expected output. For now...when the code reads the rows from the file, as soon as it reaches the 2nd row ( for the 1st event) or 4th row ( for the 2nd event) , it creates a new tag. Does it make sense?


Solution

  • Assuming an even number of lines in data.csv. The following refactored python code may work for you. Combine every 2 lines into a single pipe-delimited record that is split into an array. For each item in the array build an XML node using the corresponding node name from the row_names array.

    import xml.etree.ElementTree as ET
    import itertools
    
    node_names = ['Time','Client','User','number','processid',
    'program','randomnumber','processidandwp','userclient','transactionid',
    'additional1','additional2','additional3','additional4','additional5']
    
    root = ET.Element('Processes')
    with open('data.csv') as f:
        for l1,l2 in itertools.zip_longest(*[f]*2):
            sub_root = ET.SubElement(root, 'name')
            for idx, item in enumerate("".join([l1.strip(), l2.strip()]).split("|")):
                ele = ET.SubElement(sub_root, node_names[idx])
                ele.text = item
    
    ET.indent(root, space="  ", level=0)
    ET.dump(root)
    

    Output:

    <Processes>
      <name>
        <Time>6354</Time>
        <Client>,EGZ</Client>
        <User>2023012711283700</User>
        <number>900</number>
        <processid>DDIC</processid>
        <program>S000</program>
        <randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
        <processidandwp />
        <userclient>1</userclient>
        <transactionid>25737,00088,B5</transactionid>
        <additional1>SAP_WORKFLOW_WIM_ACTION/11283700&amp;JOB_CLOSE&amp;&amp;&amp;&amp;</additional1>
        <additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
        <additional3>42010A2A25FA1EDDA7CNBDA81EE66224C</additional3>
        <additional4>0000000000000000000000000000000000000\000000000000000000</additional4>
      </name>
      <name>
        <Time>6355</Time>
        <Client>,EGZ</Client>
        <User>2023012711283700</User>
        <number>900</number>
        <processid>DDIC</processid>
        <program>S000</program>
        <randomnumber>R_JR_BTCJOBS_GENERATOR</randomnumber>
        <processidandwp />
        <userclient>1</userclient>
        <transactionid>25737,00088,B5</transactionid>
        <additional1>SAP_WORKFLOW_WIM_ACTION/11283700&amp;JOB_CLOSE&amp;&amp;&amp;&amp;</additional1>
        <additional2>43AE5E5C16990580E0063BBEAE21BEA8</additional2>
        <additional3>42010A2A25FA1EDDA7CNBDA81EE66224C</additional3>
        <additional4>0000000000000000000000000000000000000\000000000000000000s</additional4>
      </name>
    </Processes>
    

    Verification that source data.csv has 4 lines:

    wc -l data.csv 
           4 data.csv