I have a 2.6 gigabyte text file containing a dump of a database table, and I'm trying to pull it into a logical structure so the fields can all be uniqued. The code I'm using to do this is here:
class Targetfile
include Enumerable
attr_accessor :inputfile, :headers, :input_array
def initialize(file)
@input_array = false
@inputfile = File.open(file, 'r')
@x = @inputfile.each.count
end
def get_headers
@y = 1
@inputfile.rewind
@input_array = Array.new
@headers = @inputfile.first.chomp.split(/\t/)
@inputfile.each do |line|
print "\n#{@y} / #{@x}"
@y+=1
self.assign_row(line)
end
end
def assign_row(line)
row_array = line.chomp.encode!('UTF-8', 'UTF-8', :invalid => :replace).split(/\t/)
@input_array << Hash[ @headers.zip(row_array) ]
end
def send_build
@input_array || self.get_headers
end
def each
self.send_build.each {|row| yield row}
end
end
The class is initialized successfully and I am left with a Targetfile class object.
The problem is that when I then call the get_headers
method, which converts the file into an array of hashes, it begins slowing down immediately.
This isn't noticeable to my eyes until around item number 80,000, but then it becomes apparent that every 3-4,000 lines of the file, some sort of pause is occurring. That pause, each time it occurs, takes slightly longer, until by the millionth line, it's taking longer than 30 seconds.
For practical purposes, I can just chop up the file to avoid this problem, then combine the resulting lists and unique -that- to get my final outputs.
From a curiosity standpoint, however, I'm unsatisfied.
Can anyone tell me why this pause is occurring, why it gets longer, and if there's any way to avoid it elegantly? Really I just want to know what it is and why it happens, because now that I've noticed it, I see it in a lot of other Ruby scripts I run, both on this computer and on others.
This is the infamous garbage collector -- Ruby's memory managment mechanism.
Note: It's worth mentioning that Ruby, at least MRI, isn't a high performance language.
The garbage collector runs whenever memory starts to run out. The garbage collector pauses the execution of the program to deallocate any objects that can no longer be accessed. The garbage collector only runs when memory starts to run out. That's why you're seeing it periodically.
There's nothing you can do to avoid this, except write more memory efficiant code, or rewrite in a language that can has better/manual memory management.
Also, your OS may be paging. Do you have enough physical memory for this kind of task?