Search code examples
rustserdebincode

Read bincode serialized structs in parallel from file


I am currently using serde-jsonlines to serialize a large number of identical structs to file. I am then using rayon to read data out of this file in parallel using par_bridge:

let mut reader = JsonLinesReader::new(input_file);    
let results: Vec<ResultStruct> = db_json_reader
                .read_all::<MyStruct>()
                .par_bridge()
                .into_par_iter()
                .map(|my_struct| {
                    // do processing of my struct and return result
                    result
                })
                .collect();

This works since JsonLinesReader returns an iterator over the lines of the input file. I'd like to use bincode to encode my structs instead as this results in a smaller file on disk. I have the following playground that works as expected:

use bincode;
use serde::{Deserialize, Serialize};
use std::fs::File;
use std::io::{BufWriter, Write};

#[derive(Debug, Deserialize, Serialize)]
struct MyStruct {
    name: String,
    value: Vec<u64>,
}

pub fn playground() {
    let s1 = MyStruct {
        name: "Hello".to_string(),
        value: vec![1, 2, 3],
    };
    let s2 = MyStruct {
        name: "World!".to_string(),
        value: vec![3, 4, 5, 6],
    };

    let out_file = File::create("test.bin").expect("Unable to create file");
    let mut writer = BufWriter::new(out_file);

    let s1_encoded: Vec<u8> = bincode::serialize(&s1).unwrap();
    writer.write_all(&s1_encoded).expect("Unable to write data");

    let s2_encoded: Vec<u8> = bincode::serialize(&s2).unwrap();
    writer.write_all(&s2_encoded).expect("Unable to write data");

    drop(writer);

    let mut in_file = File::open("test.bin").expect("Unable to open file");
    let s1_decoded: MyStruct =
        bincode::deserialize_from(&mut in_file).expect("Unable to read data");
    let s2_decoded: MyStruct =
        bincode::deserialize_from(&mut in_file).expect("Unable to read data");

    println!("s1_decoded: {:?}", s1_decoded);
    println!("s2_decoded: {:?}", s2_decoded);
}

Is it possible to read the structs out in parallel in a manner similar to what I am currently doing with serde-jsonlines? I imagine this might not be possible since each struct is not terminated by a newline and thus there is no sensible way to chunk up the input stream to allow processing by multiple threads.


Solution

  • Note that the serde-jsonlines code uses a single thread to parse the JSON, and only goes multithread for the map processing. The same thing can be done with bincode:

    let results: Vec<ResultStruct> = iter::from_fn (
            move || bincode::deserialize_from (&mut in_file).ok())
        .par_bridge()
        .map (|my_struct| {
            // do processing of my struct and return result
            result
        })
        .collect();
    

    (I also removed the redundant call to into_par_iter since par_bridge already creates a parallel iterator).