Search code examples
rubyyajl

Parse large JSON hash with ruby-yajl?


I have a large file (>50Mb) which contains a JSON hash. Something like:

{ 
  "obj1": {
    "key1": "val1",
    "key2": "val2"
  },
  "obj2": {
    "key1": "val1",
    "key2": "val2"
  }
  ...
}

Rather than parsing the entire file and taking say the first ten elements, I'd like to parse each item in the hash. I actually don't care about the key, i.e. obj1.

If I convert the above to this:

  {
    "key1": "val1",
    "key2": "val2"
  }
  "obj2": {
    "key1": "val1",
    "key2": "val2"
  }

I can easily achieve what I want using Yajl streaming:

io = File.open(path_to_file)
count = 10
Yajl::Parser.parse(io) do |obj|
  puts "Parsed: #{obj}"
  count -= 1
  break if count == 0
end
io.close

Is there a way to do this without having to alter the file? Some sort of callback in Yajl maybe?


Solution

  • I ended up solving this using JSON::Stream which has callbacks for start_document, start_object etc.

    I gave my 'parser' a to_enum method which emits all the 'Resource' objects as they're parsed. Note that ResourcesCollectionNode is never really used unless you completely parse the JSON stream, and the ResourceNode is a subclass of ObjectNode for naming purposes only, though I might just get rid of it:

    class Parser
      METHODS = %w[start_document end_document start_object end_object start_array end_array key value]
    
      attr_reader :result
    
      def initialize(io, chunk_size = 1024)
        @io = io
        @chunk_size = chunk_size
        @parser = JSON::Stream::Parser.new
    
        # register callback methods
        METHODS.each do |name|
          @parser.send(name, &method(name))
        end 
      end
    
      def to_enum
        Enumerator.new do |yielder|
          @yielder = yielder
          begin
            while !@io.eof?
              # puts "READING CHUNK"
              chunk = @io.read(@chunk_size)
              @parser << chunk
            end
          ensure
            @yielder = nil
          end
        end
      end
    
      def start_document
        @stack = []
        @result = nil
      end
    
      def end_document
        # @result = @stack.pop.obj
      end
    
      def start_object
        if @stack.size == 0
          @stack.push(ResourceCollectionNode.new)
        elsif @stack.size == 1
          @stack.push(ResourceNode.new)
        else
          @stack.push(ObjectNode.new)
        end
      end
    
      def end_object
        if @stack.size == 2
          node = @stack.pop
          #puts "Stack depth: #{@stack.size}. Node: #{node.class}"
          @stack[-1] << node.obj
    
          # puts "Parsed complete resource: #{node.obj}"
          @yielder << node.obj
    
        elsif @stack.size == 1
          # puts "Parsed all resources"
          @result = @stack.pop.obj
        else
          node = @stack.pop
          # puts "Stack depth: #{@stack.size}. Node: #{node.class}"
          @stack[-1] << node.obj
        end
      end
    
      def end_array
        node = @stack.pop
        @stack[-1] << node.obj
      end
    
      def start_array
        @stack.push(ArrayNode.new)
      end
    
      def key(key)
        # puts "Stack depth: #{@stack.size} KEY: #{key}"
        @stack[-1] << key
      end
    
      def value(value)
        node = @stack[-1]
        node << value
      end
    
      class ObjectNode
        attr_reader :obj
    
        def initialize
          @obj, @key = {}, nil
        end
    
        def <<(node)
          if @key
            @obj[@key] = node
            @key = nil
          else
            @key = node
          end
          self
        end
      end
    
      class ResourceNode < ObjectNode
      end
    
      # Node that contains all the resources - a Hash keyed by url
      class ResourceCollectionNode < ObjectNode
        def <<(node)
          if @key
            @obj[@key] = node
            # puts "Completed Resource: #{@key} => #{node}"
            @key = nil
          else
            @key = node
          end
          self
        end
      end
    
      class ArrayNode
        attr_reader :obj
    
        def initialize
          @obj = []
        end
    
        def <<(node)
          @obj << node
          self
        end
      end
    
    end
    

    and an example in use:

    def json
      <<-EOJ
      {
        "1": {
          "url": "url_1",
          "title": "title_1",
          "http_req": {
            "status": 200,
            "time": 10
          }
        },
        "2": {
          "url": "url_2",
          "title": "title_2",
          "http_req": {
            "status": 404,
            "time": -1
          }
        },
        "3": {
          "url": "url_1",
          "title": "title_1",
          "http_req": {
            "status": 200,
            "time": 10
          }
        },
        "4": {
          "url": "url_2",
          "title": "title_2",
          "http_req": {
            "status": 404,
            "time": -1
          }
        },
        "5": {
          "url": "url_1",
          "title": "title_1",
          "http_req": {
            "status": 200,
            "time": 10
          }
        },
        "6": {
          "url": "url_2",
          "title": "title_2",
          "http_req": {
            "status": 404,
            "time": -1
          }
        }          
    
      }
      EOJ
    end
    
    
    io = StringIO.new(json)
    resource_parser = ResourceParser.new(io, 100)
    
    count = 0
    resource_parser.to_enum.each do |resource|
      count += 1
      puts "READ: #{count}"
      pp resource
      break
    end
    
    io.close
    

    Output:

    READ: 1
    {"url"=>"url_1", "title"=>"title_1", "http_req"=>{"status"=>200, "time"=>10}}