Search code examples
filejointcllarge-data

Processing large files using Tcl


I have some information in two large files.
One of them(file1.txt, has ~ 4 million lines) contains all object names(which are unique) and types.
And the other(file2.txt, has ~ 2 million lines) some object names(they can be duplicated) and some values assigned to them.
So, I have something like below in file1.txt:

objName1 objType1
objName2 objType2
objName3 objType3
...

And in file2.txt I have:

objName3 val3_1
objName3 val3_2
objName4 val4
...

For the all objects in file2.txt I need to output object names, their types and values assigned to them in a single file like below:

objType3 val3_1 "objName3"
objType3 val3_2 "objName3"
objType4 val4 "objName4"
...

Previously object names in file2.txt supposed to be unique, so I've implemented some solution, where I'm reading all the data from both files, saving them to a Tcl arrays, and then iterating over larger array and checking whether object with the same name exists in a smaller array, and if so, writing my needed information to a separate file. But this runs too long (> 10 hours and hasn't completed yet).
How can I improve my solution, or is there another way to do this?

EDIT:
Actually I don't have file1.txt, I'm finding that data by some procedure and writing it into Tcl array. I'm running some procedure to get object types and save them to a Tcl array, then, I'm reading file2.txt and saving data to a Tcl array, then I'm iterating over items in the first array, and if object name match some object in second(object values) array, I'm writing info to output file and erasing that element from the second array. Here is a piece of code that I'm running:

set outFileName "output.txt"
if [catch {open $outFileName "w"} fid ] {
   puts "ERROR: Failed to open file '$outFileName', no write permission"
   exit 1
}


# get object types
set TIME_start [clock clicks -milliseconds]
array set objTypeMap [list]
# here is some proc that fills up objTypeMap
set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Object types are found. Elapsed time $TIME_taken"

# read file2.txt
set TIME_start [clock clicks -milliseconds]
set file2 [lindex $argv 5]
if [catch { set fp [open $file2 r] } errMsg] {
    puts "ERROR: Failed to open file '$file2' for reading"
    exit 1
}

set objValData [read $fp]
close $fp
# tcl list containing lines of file2.txt
set objValData [split $objValData "\n"]
# remove last empty line
set objValData [lreplace $objValData end end]
array set objValMap [list]
foreach item $objValData {
    set objName [string range $item 0 [expr {[string first " " $item] - 1}] ]
    set objValue [string range $item [expr {[string first " " $item] + 1}] end ]
    set objValMap($instName) $objValue
}
# clear objValData
unset objValData

set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Object value data is read and processed. Elapsed time $TIME_taken"

# write to file
set TIME_start [clock clicks -milliseconds]
foreach { objName objType } [array get objTypeMap] {
    if { [array size objValMap] eq 0 } {
        break
    }
    if { [info exists objValMap($objName)] } {
        set objValue $objValMap($objName)
        puts $fid "$objType $objValue \"$objName\""
        unset objValMap($objName)
    }
}

if { [array size objValMap] neq 0 } {
    foreach { objName objVal } [array get objValMap] {
        puts "WARNING: Can not find object $objName type, skipped..."
    }
}
close $fid

set TIME_taken [expr [clock clicks -milliseconds] - $TIME_start]
puts "Info: Output is cretaed. Elapsed time $TIME_taken"

Seems for the last step (writing to a file) there are ~8 * 10^12 iterations to do, and it's not realistic to complete in a reasonable time, because I've tried to do 8 * 10^12 iterations in a for loop and just print the iteration index, and ~850*10^6 iterations took ~30 minutes (so, the whole loop will finish in ~11hours).
So, there should be another solution.

EDIT: Seems the reason was some unsuccessful hashing for file2.txt map, as I've tried to shuffle lines in file2.txt and got results in about 3 minutes.


Solution

  • So… file1.txt is describing a mapping and file2.txt is the list of things to process and annotate? The right thing is to load the mapping into an array or dictionary where the key is the part that you will look things up by, and to then go through the other file line-by-line. That keeps the amount of data in memory down, but it's worth holding the whole mapping that way anyway.

    # We're doing many iterations, so worth doing proper bytecode compilation 
    apply {{filename1 filename2 filenameOut} {
        # Load the mapping; uses memory proportional to the file size
        set f [open $filename1]
        while {[gets $f line] >= 0} {
            regexp {^(\S+)\s+(.*)} $line -> name type
            set types($name) $type
        }
        close $f
    
        # Now do the streaming transform; uses a small fixed amount of memory
        set fin [open $filename2]
        set fout [open $filenameOut "w"]
        while {[gets $fin line] >= 0} {
            # Assume that the mapping is probably total; if a line fails we're print it as
            # it was before. You might have a different preferred strategy here.
            catch {
                regexp {^(\S+)\s+(.*)} $line -> name info
                set line [format "%s %s \"%s\"" $types($name) $info $name]
            }
            puts $fout $line
        }
        close $fin
        close $fout
    
        # All memory will be collected at this point
    }} "file1.txt" "file2.txt" "fileProcessed.txt"
    

    Now, if the mapping is very large, so much that it doesn't fit in memory, then you might be better doing it via building file indices and stuff like that, but frankly then you're actually better off getting familiar with SQLite or some other database.