Search code examples
rubyperformancefile-ioclass-method

Why does my Ruby script slow down over time?


I have a 2.6 gigabyte text file containing a dump of a database table, and I'm trying to pull it into a logical structure so the fields can all be uniqued. The code I'm using to do this is here:

class Targetfile
  include Enumerable

  attr_accessor :inputfile, :headers, :input_array

  def initialize(file)
    @input_array = false
    @inputfile = File.open(file, 'r')
    @x = @inputfile.each.count
  end

  def get_headers
    @y = 1
    @inputfile.rewind
    @input_array = Array.new
    @headers = @inputfile.first.chomp.split(/\t/)
    @inputfile.each do |line|
      print "\n#{@y} / #{@x}"
      @y+=1
      self.assign_row(line)
    end
  end

  def assign_row(line)
    row_array = line.chomp.encode!('UTF-8', 'UTF-8', :invalid => :replace).split(/\t/)
    @input_array << Hash[ @headers.zip(row_array) ]
  end

  def send_build
    @input_array || self.get_headers
  end

  def each
    self.send_build.each {|row| yield row}
  end

end

The class is initialized successfully and I am left with a Targetfile class object.

The problem is that when I then call the get_headers method, which converts the file into an array of hashes, it begins slowing down immediately.

This isn't noticeable to my eyes until around item number 80,000, but then it becomes apparent that every 3-4,000 lines of the file, some sort of pause is occurring. That pause, each time it occurs, takes slightly longer, until by the millionth line, it's taking longer than 30 seconds.

For practical purposes, I can just chop up the file to avoid this problem, then combine the resulting lists and unique -that- to get my final outputs.

From a curiosity standpoint, however, I'm unsatisfied.

Can anyone tell me why this pause is occurring, why it gets longer, and if there's any way to avoid it elegantly? Really I just want to know what it is and why it happens, because now that I've noticed it, I see it in a lot of other Ruby scripts I run, both on this computer and on others.


Solution

  • This is the infamous garbage collector -- Ruby's memory managment mechanism.

    Note: It's worth mentioning that Ruby, at least MRI, isn't a high performance language.

    The garbage collector runs whenever memory starts to run out. The garbage collector pauses the execution of the program to deallocate any objects that can no longer be accessed. The garbage collector only runs when memory starts to run out. That's why you're seeing it periodically.

    There's nothing you can do to avoid this, except write more memory efficiant code, or rewrite in a language that can has better/manual memory management.

    Also, your OS may be paging. Do you have enough physical memory for this kind of task?