Search code examples
jsonrustetllarge-datalarge-files

Load large json file (100GB+) in rust


I am looking to load a large json file over 100GB+. The objects in this file aren't static and are almost never the same. I found this crate called nop-json https://crates.io/crates/nop-json/2.0.5. but I was unable to get it to the work in the way that I want. This is my current solution but it feels a bit like cheating.

    let file = File::open("./file.json")?;
    let reader = BufReader::new(file);
    for line in reader.lines() {
        //code
    }

I am reading the file like a text file and itterating that way. the problem is that with this solution I am reading it as a string and that it loads the entire file into memory.

I am new to rust so I am looking for some help on this problem. I have a succesful implementation in python and it works great but its too slow.

edit:

Thank you for the replies so far here is some more information:

My *.json file has 1 array containing milions of objects. example:

[
    {
        "foo": "bar",
        "bar": "foor"
    },
    {
        "foo": "bar",
        "bar": "foor"
    },
    {
        "foo": "bar",
        "bar": "foor"
    },
    {
        "foo": "bar",
        "bar": "foor"
    }
]

etc..

The problem with reading the file as a text file this way is that not every object is 1 line exactly. The amount of lines for an object is not the same.

Some possible soltuion might be to read a chunk of the file and then check where the json object ended via something like a pattern }, {. But this seems inaficiant.


Solution

  • First off, if you accept normal full JSON, your problem is really hard.

    So I assume the following:

    • Your file always starts with a [.
    • Then, an arbitrary number of valid JSON strings follow, separated by ,.
    • After the JSON strings there is another ].
    • Every single JSON string is small enough to be parsed and held in memory in its entirety.

    Meaning, we now have a bunch of streamable separate JSON objects that are wrapped by our own array representation.

    With that, we can utilize serde_json and a little bit of glue to parse the file value by value:

    use std::error::Error;
    use std::io::Read;
    
    use serde_json::{Deserializer, Value};
    
    const JSON_FILE: &[u8] = br#"[
        {
            "foo": "bar",
            "bar": "foor"
        },
        {
            "foo": "bar",
            "bar": "foor"
        },
        {
            "foo": "bar",
            "bar": "foor"
        },
        {
            "foo": "bar",
            "bar": "foor"
        }
    ]"#;
    
    fn open_file() -> impl Read {
        JSON_FILE
    }
    
    fn take_json_value(input_stream: &mut dyn Read) -> Result<Value, Box<dyn Error>> {
        Ok(Deserializer::from_reader(input_stream)
            .into_iter()
            .next()
            .ok_or("Expected a JSON value!")??)
    }
    
    fn main() {
        // Is of type `impl Read`, and can only be read once.
        // (to reproduce the situation of reading a file)
        let mut input_stream = open_file();
    
        // Skip initial `[`
        let mut skipped = 0u8;
        input_stream
            .read_exact(std::slice::from_mut(&mut skipped))
            .unwrap();
        assert_eq!(skipped, b'[');
    
        loop {
            let value = take_json_value(&mut input_stream).unwrap();
    
            println!("- {}", value);
    
            // Skip `,` after the value
            input_stream
                .read_exact(std::slice::from_mut(&mut skipped))
                .unwrap();
            if skipped != b',' {
                break;
            }
        }
    
        // Verify that the ending `]` exists
        let mut leftover_data = vec![b'[', skipped];
        input_stream.read_to_end(&mut leftover_data).unwrap();
        serde_json::from_slice::<[u8; 0]>(&leftover_data).unwrap();
    }
    
    - Object {"bar": String("foor"), "foo": String("bar")}
    - Object {"bar": String("foor"), "foo": String("bar")}
    - Object {"bar": String("foor"), "foo": String("bar")}
    - Object {"bar": String("foor"), "foo": String("bar")}