I am trying to efficiently parse CSV files line by line without unnecessary memory allocation.
Since we can't index into strings in Rust, my idea was to create a struct for each line that has an owned Vec<char>
of the line characters and several &[char]
slices representing the locations in that Vec
of the fields that will require further processing.
I am supporting English only, so there's no need for Unicode graphemes.
I grab each line from the BufReader
, collect it into my Vec<char>
and then iterate over the characters to notice the correct offsets for each field slice:
let mut r_line: String;
let mut char_count: usize;
let mut comma_count: usize;
let mut payload_start: usize;
for stored in &ms7_files {
let reader = BufReader::new(File::open(&stored.as_path()).unwrap());
for line in reader.lines() {
r_line = line.unwrap().to_string();
let r_chars: Vec<char> = r_line.chars().collect();
char_count = 0;
comma_count = 0;
payload_start = 0;
for chara in r_chars {
char_count += 1;
if chara == ',' {
comma_count += 1;
if comma_count == 1 {
let r_itemid = &r_chars[0..char_count - 1];
payload_start = char_count + 1;
} else if comma_count == 2 {
let r_date = &r_chars[payload_start..char_count - 1];
let r_payload = & r_chars[payload_start..r_line.len() - 1];
}
}
}
// Code omitted here to initialize a struct described in my
// text above and add it to a Vec for later processing
}
}
All goes swimmingly until the code inside if
tests on comma_count
where I attempt to create char slices into the Vec
. When I attempt to compile, I get the dreaded:
proc_sales.rs:87:23: 87:30 error: use of moved value: `r_chars` [E0382]
proc_sales.rs:87 let r_itemid = &r_chars[0..char_count - 1];
^~~~~~
proc_sales.rs:87:23: 87:30 help: run `rustc --explain E0382` to see a detailed explanation
proc_sales.rs:82:17: 82:24 note: `r_chars` moved here because it has type `collections::vec::Vec<char>`, which is non-copyable
proc_sales.rs:82 for chara in r_chars {
^~~~~~~
for each of the attempts to create a slice. I can basically understand why the compiler is complaining. What I'm trying to figure out is a better strategy to collect and process this data without resorting to a lot of copying and cloning. Heck, if I could leave the original String
(for each file line) owned by the BufReader
and just hold on to slices into that, I would!
Feel free to comment on fixing up the above code as well as suggestions for alternative approaches to this problem that limit unnecessary copying.
BufReader::lines
returns a iterator that produces Result<String>
items. When unwrap
is called on such a item it will always allocates a new String
(note that in line.unwrap().to_string()
, to_string()
is redundant).
If you want to minimize allocations, you can use BufReader::read_line
.
To split the fields of a CSV line you can use str::split
.
Here is a simplified version of your code:
use std::io::{BufRead, BufReader};
use std::fs::File;
fn main() {
let mut line = String::new();
let ms7_files = ["file1.cvs", "file2.cvs"];
for stored in &ms7_files {
let mut reader = BufReader::new(File::open(stored).unwrap());
while reader.read_line(&mut line).unwrap() > 0 {
// creates a scope to the iterator, so we can call line.clear()
{
// does not allocate
let mut it = line.split(',');
// item_id, date and payload are string slices, that is &str
let item_id = it.next().expect("no item_id fied");
let date = it.next().expect("no date field");
let payload = it.next().expect("no payload field");
// process fields
}
// sets len of line to 0, but does not deallocate
line.clear()
}
}
}
You may also want to take a look at various crates to work with CSV files.