Search code examples
csvrustrust-polars

Rust, polars CSV: Is there a way of reading CSV from an `impl BufRead` (or any iterator over bytes)?


I am parsing a bit of a funky, proprietary plain-text file format.

The format contains CSV mixed with non-CSV. I am only interested in the CSV part, which is located somewhere in the file, surrounded by non-csv.

I was wondering if it's possible to give CsvReader/LazyCsvReader something like an std::io::BufReader or even a Vec<u8> that contains the CSV contents, instead of having to provide an AsRef<Path> (which has to point to a file, if I'm not mistaken).

I want to initialize a CSV reader in one of the following ways:

  • Give it a BufReader that wraps the lines I want to read
  • Give it a Vec<u8> which contains all the bytes I want to read.

Can this be done, or do I have to write a temporary file, containing only the CSV?

I tried giving CsvReader a BufReader<File>, where I had already advanced the .lines() iterator to where my data starts. But it seams that CsvReader moves the cursor to the start of the stream before reading.


Solution

  • You can pass your Vec<u8> (or any other thing that implemnts AsRef<[u8]>) wrapped in a Cursor to CsvReader::new:

    use polars::prelude::*;
    use std::io::Cursor;
    fn main() {
        let bytes = b"a,b,c\nd,e,f\ng,h,i".to_vec();
        let reader = CsvReader::new(Cursor::new(bytes));
        dbg!(reader.finish().unwrap());
    }
    

    If your CSV data is delimited by newlines from your proprietary additional data you can also just use with_skip_rows and with_n_rows to skip leading foreign data and only read the number of rows that are actual CSV:

        let reader = CsvReader::from_path("tmp.data")
            .unwrap()
            .with_skip_rows(1);
            .with_n_rows(Some(2));
    

    tmp.data:

    some proprietary none csv data
    a,b,c
    d,e,f
    g,h,i
    even more proprietary none csv data
    

    Both variants produce the same DataFrame:

    ┌─────┬─────┬─────┐
    │ a   ┆ b   ┆ c   │
    │ --- ┆ --- ┆ --- │
    │ str ┆ str ┆ str │
    ╞═════╪═════╪═════╡
    │ d   ┆ e   ┆ f   │
    │ g   ┆ h   ┆ i   │
    └─────┴─────┴─────┘
    

    Other than that, I don't think you can use a BufReader without implementing MmapBytesReader for a wrapper of it.