I have a very large number of very large files.
Each file contains lines Like this:
uuid1 (tab) data1 (vtab) data2 ... dataN
uuid2 (tab) data1' (vtab) data2' (vtab) data3' (vtab) ... dataN'
....
where N will be different for every line. The result needs to look like:
uuid1 (tab) data1
uuid1 (tab) data2
....
uuid1 (tab) dataN
uuid2 (tab) data1'
uuid2 (tab) data2'
uuid2 (tab) data3'
...
uuid2 (tab) dataN'
....
I have a regex that does the job, replacing:
^([abcdef0123456789]{8}-[abcdef0123456789]{4}-[abcdef0123456789]{4}-[abcdef0123456789]{4}-[abcdef0123456789]{12})\t(.+?)\x0B
with:
\1\t\2\n\1\t
but it's slow, and needs repeated applications, obviously.
Is there a quicker programmatic way to perform this across all the files?
Tools available in the toolbox: unix tools (sed, awk etc), python, possibly perl.
Not looking for a religious war, just a pragmatic approach.
Additional Info
Here's the complete script I used, based on Kristof's script, for handling the outer loop:
#!/usr/bin/python
import os
import uuid
def processFile( in_filename ):
out_filename = os.path.splitext(in_filename)[0] + '.result.txt'
with open(in_filename) as f_in:
with open(out_filename, 'w') as f_out:
for line in f_in:
try:
# Retrieve the line and split into UUID and data
line_uuid, data = line.split('\t')
# Validate UUID
uuid.UUID(line_uuid)
except ValueError:
# Ignore this line
continue
# Write each individual piece of data to a separate line
for data_part in data.rstrip().split('\x0b'):
f_out.write(line_uuid + '\t' + data_part + '\n')
for i in os.listdir(os.getcwd()):
if i.endswith(".txt"):
print i
processFile( i )
continue
else:
continue
This is an implementation in Python (tested in 3.5). I haven't tried this on a large data set, I'll leave that for you to try out.
import uuid
in_filename = 'test.txt'
out_filename = 'parsed.txt'
with open(in_filename) as f_in:
with open(out_filename, 'w') as f_out:
for line in f_in:
try:
# Retrieve the line and split into UUID and data
line_uuid, data = line.split('\t', maxsplit=1)
# Validate UUID
uuid.UUID(line_uuid)
except ValueError:
# Ignore this line
continue
# Write each individual piece of data to a separate line
for data_part in data.rstrip().split('\x0b'):
f_out.write(line_uuid + '\t' + data_part + '\n')