Search code examples
csvrustsliceallocationtext-parsing

How To Assign Slices While Iterating Over a Vec in Rust without copying?


I am trying to efficiently parse CSV files line by line without unnecessary memory allocation.

Since we can't index into strings in Rust, my idea was to create a struct for each line that has an owned Vec<char> of the line characters and several &[char] slices representing the locations in that Vec of the fields that will require further processing.

I am supporting English only, so there's no need for Unicode graphemes.

I grab each line from the BufReader, collect it into my Vec<char> and then iterate over the characters to notice the correct offsets for each field slice:

let mut r_line: String;
let mut char_count: usize;
let mut comma_count: usize;
let mut payload_start: usize;
for stored in &ms7_files {
    let reader = BufReader::new(File::open(&stored.as_path()).unwrap());
    for line in reader.lines() {
        r_line = line.unwrap().to_string();
        let r_chars: Vec<char> = r_line.chars().collect();
        char_count = 0;
        comma_count = 0;
        payload_start = 0;
        for chara in r_chars {
            char_count += 1;
            if chara == ',' {
                comma_count += 1;
                if comma_count == 1 {
                    let r_itemid = &r_chars[0..char_count - 1];
                    payload_start = char_count + 1;
                } else if comma_count == 2 {
                    let r_date = &r_chars[payload_start..char_count - 1];
                    let r_payload = & r_chars[payload_start..r_line.len() - 1];
                }
            }
        }
        // Code omitted here to initialize a struct described in my
        // text above and add it to a Vec for later processing
    }
}

All goes swimmingly until the code inside if tests on comma_count where I attempt to create char slices into the Vec. When I attempt to compile, I get the dreaded:

proc_sales.rs:87:23: 87:30 error: use of moved value: `r_chars` [E0382]
proc_sales.rs:87                        let r_itemid = &r_chars[0..char_count - 1];
                                                        ^~~~~~
proc_sales.rs:87:23: 87:30 help: run `rustc --explain E0382` to see a detailed explanation
proc_sales.rs:82:17: 82:24 note: `r_chars` moved here because it has type `collections::vec::Vec<char>`, which is non-copyable
proc_sales.rs:82            for chara in r_chars {
                                     ^~~~~~~

for each of the attempts to create a slice. I can basically understand why the compiler is complaining. What I'm trying to figure out is a better strategy to collect and process this data without resorting to a lot of copying and cloning. Heck, if I could leave the original String (for each file line) owned by the BufReader and just hold on to slices into that, I would!

Feel free to comment on fixing up the above code as well as suggestions for alternative approaches to this problem that limit unnecessary copying.


Solution

  • BufReader::lines returns a iterator that produces Result<String> items. When unwrap is called on such a item it will always allocates a new String (note that in line.unwrap().to_string(), to_string() is redundant).

    If you want to minimize allocations, you can use BufReader::read_line.

    To split the fields of a CSV line you can use str::split.

    Here is a simplified version of your code:

    use std::io::{BufRead, BufReader};
    use std::fs::File;
    
    fn main() {
        let mut line = String::new();
        let ms7_files = ["file1.cvs", "file2.cvs"];
        for stored in &ms7_files {
            let mut reader = BufReader::new(File::open(stored).unwrap());
            while reader.read_line(&mut line).unwrap() > 0 {
                // creates a scope to the iterator, so we can call line.clear()
                {
                    // does not allocate
                    let mut it = line.split(',');
                    // item_id, date and payload are string slices, that is &str
                    let item_id = it.next().expect("no item_id fied");
                    let date = it.next().expect("no date field");
                    let payload = it.next().expect("no payload field");
                    // process fields
                }
                // sets len of line to 0, but does not deallocate
                line.clear()
            }
        }
    }
    

    You may also want to take a look at various crates to work with CSV files.