Search code examples
pythonperlawktext-processingflat-file

processing text from a non-flat file (to extract information as if it *were* a flat file)


I have a longitudinal data set generated by a computer simulation that can be represented by the following tables ('var' are variables):

time subject var1 var2 var3
t1   subjectA  ...
t2   subjectB  ...

and

subject   name
subjectA  nameA
subjectB  nameB

However, the file generated writes a data file in a format similar to the following:

time t1 
  description
subjectA nameA
  var1 var2 var3
subjectB nameB
  var1 var2 var3
time t2
  description
subjectA nameA
  var1 var2 var3
subjectB nameB
  var1 var2 var3
...(and so on)

I have been using a (python) script to process this output data into a flat text file so that I can import it into R, python, SQL, or awk/grep it to extract information - an example of the type of information desired from a single query (in SQL notation, after the data is converted to a table) is shown below:

SELECT var1, var2, var3 FROM datatable WHERE subject='subjectB'

I wonder if there is a more efficient solution as each of these data files can be ~100MB each (and I have hundreds of them) and creating the flat text file is time-consuming and takes up additional hard drive space with redundant information. Ideally, I would interact with the original data set directly to extract the information that I desire, without creating the extra flat text file... Is there an awk/perl solution for such tasks that is simpler? I'm quite proficient at text-processing in python but my skills in awk are rudimentary and I have no working knowledge of perl; I wonder if these or other domain-specific tools can provide a better solution.

Thanks!

Postscript: Wow, thanks to all! I am sorry that I cannot choose everyone's answers @FM: thanks. My Python script resembles your code without the filtering step. But your organization is clean. @PP: I thought I was already proficient in grep but apparently not! This is very helpful... but I think grepping becomes difficult when mixing the 'time' into the output (which I failed to include as a possible extraction scenario in my example! That's my bad). @ghostdog74: This is just fantastic... but modifying the line to get 'subjectA' was not straightforward... (though I'll be reading up more on awk in the meantime and hopefully I'll grok later). @weismat: Well stated. @S.Lott: This is extremely elegant and flexible - I was not asking for a python(ic) solution but this fits in cleanly with the parse, filter, and output framework suggested by PP, and is flexible enough to accommodate a number of different queries to extract different types of information from this hierarchical file.

Again, I am grateful to everyone - thanks so much.


Solution

  • This is what Python generators are all about.

    def read_as_flat( someFile ):
        line_iter= iter(someFile)
        time_header= None
        for line in line_iter:
            words = line.split()
            if words[0] == 'time':
                time_header = [ words[1:] ] # the "time" line
                description= line_iter.next()
                time_header.append( description )
            elif words[0] in subjectNameSet:
                data = line_iter.next()
                yield time_header + data
    

    You can use this like a standard Python iterator

    for time, description, var1, var2, var3 in read_as_flat( someFile ):
        etc.