I am currently using serde-jsonlines
to serialize a large number of identical structs to file. I am then using rayon
to read data out of this file in parallel using par_bridge
:
let mut reader = JsonLinesReader::new(input_file);
let results: Vec<ResultStruct> = db_json_reader
.read_all::<MyStruct>()
.par_bridge()
.into_par_iter()
.map(|my_struct| {
// do processing of my struct and return result
result
})
.collect();
This works since JsonLinesReader
returns an iterator over the lines of the input file. I'd like to use bincode
to encode my structs instead as this results in a smaller file on disk. I have the following playground that works as expected:
use bincode;
use serde::{Deserialize, Serialize};
use std::fs::File;
use std::io::{BufWriter, Write};
#[derive(Debug, Deserialize, Serialize)]
struct MyStruct {
name: String,
value: Vec<u64>,
}
pub fn playground() {
let s1 = MyStruct {
name: "Hello".to_string(),
value: vec![1, 2, 3],
};
let s2 = MyStruct {
name: "World!".to_string(),
value: vec![3, 4, 5, 6],
};
let out_file = File::create("test.bin").expect("Unable to create file");
let mut writer = BufWriter::new(out_file);
let s1_encoded: Vec<u8> = bincode::serialize(&s1).unwrap();
writer.write_all(&s1_encoded).expect("Unable to write data");
let s2_encoded: Vec<u8> = bincode::serialize(&s2).unwrap();
writer.write_all(&s2_encoded).expect("Unable to write data");
drop(writer);
let mut in_file = File::open("test.bin").expect("Unable to open file");
let s1_decoded: MyStruct =
bincode::deserialize_from(&mut in_file).expect("Unable to read data");
let s2_decoded: MyStruct =
bincode::deserialize_from(&mut in_file).expect("Unable to read data");
println!("s1_decoded: {:?}", s1_decoded);
println!("s2_decoded: {:?}", s2_decoded);
}
Is it possible to read the structs out in parallel in a manner similar to what I am currently doing with serde-jsonlines
? I imagine this might not be possible since each struct is not terminated by a newline and thus there is no sensible way to chunk up the input stream to allow processing by multiple threads.
Note that the serde-jsonlines
code uses a single thread to parse the JSON, and only goes multithread for the map
processing. The same thing can be done with bincode
:
let results: Vec<ResultStruct> = iter::from_fn (
move || bincode::deserialize_from (&mut in_file).ok())
.par_bridge()
.map (|my_struct| {
// do processing of my struct and return result
result
})
.collect();
(I also removed the redundant call to into_par_iter
since par_bridge
already creates a parallel iterator).