Search code examples
pythonregexawkseddata-cleaning

Data Cleanup with Regex


I have a very large number of very large files.

Each file contains lines Like this:

uuid1 (tab) data1 (vtab) data2 ...  dataN
uuid2 (tab) data1' (vtab) data2' (vtab) data3' (vtab) ...  dataN'
....

where N will be different for every line. The result needs to look like:

uuid1 (tab) data1
uuid1 (tab) data2
....
uuid1 (tab) dataN
uuid2 (tab) data1'
uuid2 (tab) data2'
uuid2 (tab) data3'
...  
uuid2 (tab) dataN'
....

I have a regex that does the job, replacing:

^([abcdef0123456789]{8}-[abcdef0123456789]{4}-[abcdef0123456789]{4}-[abcdef0123456789]{4}-[abcdef0123456789]{12})\t(.+?)\x0B

with:

\1\t\2\n\1\t

but it's slow, and needs repeated applications, obviously.

Is there a quicker programmatic way to perform this across all the files?

Tools available in the toolbox: unix tools (sed, awk etc), python, possibly perl.

Not looking for a religious war, just a pragmatic approach.

Additional Info

Here's the complete script I used, based on Kristof's script, for handling the outer loop:

#!/usr/bin/python

import os
import uuid

def processFile( in_filename ):

  out_filename = os.path.splitext(in_filename)[0] + '.result.txt'

  with open(in_filename) as f_in:
    with open(out_filename, 'w') as f_out:
      for line in f_in:
        try:
          # Retrieve the line and split into UUID and data
          line_uuid, data = line.split('\t')
          # Validate UUID
          uuid.UUID(line_uuid)
        except ValueError:
          # Ignore this line
          continue
        # Write each individual piece of data to a separate line
        for data_part in data.rstrip().split('\x0b'):
          f_out.write(line_uuid + '\t' + data_part  + '\n')

for i in os.listdir(os.getcwd()):
  if i.endswith(".txt"): 
    print i
    processFile( i )
    continue
  else:
    continue

Solution

  • This is an implementation in Python (tested in 3.5). I haven't tried this on a large data set, I'll leave that for you to try out.

    import uuid
    
    in_filename = 'test.txt'
    out_filename = 'parsed.txt'
    
    with open(in_filename) as f_in:
        with open(out_filename, 'w') as f_out:
            for line in f_in:
                try:
                    # Retrieve the line and split into UUID and data
                    line_uuid, data = line.split('\t', maxsplit=1)
                    # Validate UUID
                    uuid.UUID(line_uuid)
                except ValueError:
                    # Ignore this line
                    continue
                # Write each individual piece of data to a separate line
                for data_part in data.rstrip().split('\x0b'):
                    f_out.write(line_uuid + '\t' + data_part  + '\n')