Search code examples
rustdata-sciencerust-ndarray

Splitting a Vec of strings into Vec<Vec<String>>


I am attempting to relearn data-science in rust.

I have a Vec<String> that includes a delimiter "|" and a new line "!end".

What I'd like to end up with is Vec<Vec<String>> that can be put into a 2D ND array.

I have this python Code:

file = open('somefile.dat')
lst = []
for line in file:
    lst += [line.split('|')]
    
df = pd.DataFrame(lst)
SAMV2FinalDataFrame = pd.DataFrame(lst,columns=column_names)

And i've recreated it here in rust:



fn lines_from_file(filename: impl AsRef<Path>) -> Vec<String> {
    let file = File::open(filename).expect("no such file");
    let buf = BufReader::new(file);
    buf.lines()
        .map(|l| l.expect("Could not parse line"))
        .collect()
}

fn main() {
    let lines = lines_from_file(".dat");
    let mut new_arr = vec![];
//Here i get a lines immitable borrow
    for line in lines{
        new_arr.push([*line.split("!end")]);
    }

// here i get expeected closure found str
let x = lines.split("!end");



let array = Array::from(lines)

what i have: ['1','1','1','end!','2','2','2','!end'] What i need: [['1','1','1'],['2','2','2']]

Edit: also why when i turbo fish does it make it disappear on Stack Overflow?


Solution

  • I think part of the issue you ran into was due how you worked with arrays. For example, Vec::push will only add a single element so you would want to use Vec::extend instead. I also ran into a few cases of empty strings due to splitting by "!end" would leave trailing '|' on the ends of substrings. The errors were quite strange, I am not completely sure where the closure came from.

    let lines = vec!["1|1|1|!end|2|2|2|!end".to_string()];
    let mut new_arr = Vec::new();
    
    // Iterate over &lines so we don't consume lines and it can be used again later
    for line in &lines {
        new_arr.extend(line.split("!end")
            // Remove trailing empty string
            .filter(|x| !x.is_empty())
            // Convert each &str into a Vec<String>
            .map(|x| {
                x.split('|')
                    // Remove empty strings from ends split (Ex split: "|2|2|2|")
                    .filter(|x| !x.is_empty())
                    // Convert &str into owned String
                    .map(|x| x.to_string())
                    // Turn iterator into Vec<String>
                    .collect::<Vec<_>>()
        }));
    }
    
    println!("{:?}", new_arr);
    

    I also came up with this other version which should handle your use case better. The earlier approach dropped all empty strings, while this one should preserve them while correctly handling the "!end".

    use std::io::{self, BufRead, BufReader, Read, Cursor};
    
    fn split_data<R: Read>(buffer: &mut R) -> io::Result<Vec<Vec<String>>> {
        let mut sections = Vec::new();
        let mut current_section = Vec::new();
        
        for line in BufReader::new(buffer).lines() {
            for item in line?.split('|') {
                if item != "!end" {
                    current_section.push(item.to_string());
                } else {
                    sections.push(current_section);
                    current_section = Vec::new();
                }
            }
        }
            
        Ok(sections)
    }
    

    In this example, I used Read for easier testing, but it will also work with a file.

    let sample_input = b"1|1|1|!end|2|2|2|!end";
    println!("{:?}", split_data(&mut Cursor::new(sample_input)));
    // Output: Ok([["1", "1", "1"], ["2", "2", "2"]])
    
    // You can also use a file instead
    let mut file = File::new("somefile.dat");
    let solution: Vec<Vec<String>> = split_data(&mut file).unwrap();
    

    playground link