I am looking to load a large json file over 100GB+. The objects in this file aren't static and are almost never the same. I found this crate called nop-json https://crates.io/crates/nop-json/2.0.5. but I was unable to get it to the work in the way that I want. This is my current solution but it feels a bit like cheating.
let file = File::open("./file.json")?;
let reader = BufReader::new(file);
for line in reader.lines() {
//code
}
I am reading the file like a text file and itterating that way. the problem is that with this solution I am reading it as a string and that it loads the entire file into memory.
I am new to rust so I am looking for some help on this problem. I have a succesful implementation in python and it works great but its too slow.
edit:
Thank you for the replies so far here is some more information:
My *.json file has 1 array containing milions of objects. example:
[
{
"foo": "bar",
"bar": "foor"
},
{
"foo": "bar",
"bar": "foor"
},
{
"foo": "bar",
"bar": "foor"
},
{
"foo": "bar",
"bar": "foor"
}
]
etc..
The problem with reading the file as a text file this way is that not every object is 1 line exactly. The amount of lines for an object is not the same.
Some possible soltuion might be to read a chunk of the file and then check where the json object ended via something like a pattern }, {
. But this seems inaficiant.
First off, if you accept normal full JSON, your problem is really hard.
So I assume the following:
[
.,
.]
.Meaning, we now have a bunch of streamable separate JSON objects that are wrapped by our own array representation.
With that, we can utilize serde_json
and a little bit of glue to parse the file value by value:
use std::error::Error;
use std::io::Read;
use serde_json::{Deserializer, Value};
const JSON_FILE: &[u8] = br#"[
{
"foo": "bar",
"bar": "foor"
},
{
"foo": "bar",
"bar": "foor"
},
{
"foo": "bar",
"bar": "foor"
},
{
"foo": "bar",
"bar": "foor"
}
]"#;
fn open_file() -> impl Read {
JSON_FILE
}
fn take_json_value(input_stream: &mut dyn Read) -> Result<Value, Box<dyn Error>> {
Ok(Deserializer::from_reader(input_stream)
.into_iter()
.next()
.ok_or("Expected a JSON value!")??)
}
fn main() {
// Is of type `impl Read`, and can only be read once.
// (to reproduce the situation of reading a file)
let mut input_stream = open_file();
// Skip initial `[`
let mut skipped = 0u8;
input_stream
.read_exact(std::slice::from_mut(&mut skipped))
.unwrap();
assert_eq!(skipped, b'[');
loop {
let value = take_json_value(&mut input_stream).unwrap();
println!("- {}", value);
// Skip `,` after the value
input_stream
.read_exact(std::slice::from_mut(&mut skipped))
.unwrap();
if skipped != b',' {
break;
}
}
// Verify that the ending `]` exists
let mut leftover_data = vec![b'[', skipped];
input_stream.read_to_end(&mut leftover_data).unwrap();
serde_json::from_slice::<[u8; 0]>(&leftover_data).unwrap();
}
- Object {"bar": String("foor"), "foo": String("bar")}
- Object {"bar": String("foor"), "foo": String("bar")}
- Object {"bar": String("foor"), "foo": String("bar")}
- Object {"bar": String("foor"), "foo": String("bar")}