I am parsing a bit of a funky, proprietary plain-text file format.
The format contains CSV mixed with non-CSV. I am only interested in the CSV part, which is located somewhere in the file, surrounded by non-csv.
I was wondering if it's possible to give CsvReader
/LazyCsvReader
something like an std::io::BufReader
or even a Vec<u8>
that contains the CSV contents, instead of having to provide an AsRef<Path>
(which has to point to a file, if I'm not mistaken).
I want to initialize a CSV reader in one of the following ways:
BufReader
that wraps the lines I want to readVec<u8>
which contains all the bytes I want to read.Can this be done, or do I have to write a temporary file, containing only the CSV?
I tried giving CsvReader
a BufReader<File>
, where I had already advanced the .lines()
iterator to where my data starts. But it seams that CsvReader
moves the cursor to the start of the stream before reading.
You can pass your Vec<u8>
(or any other thing that implemnts AsRef<[u8]>
) wrapped in a Cursor
to CsvReader::new
:
use polars::prelude::*;
use std::io::Cursor;
fn main() {
let bytes = b"a,b,c\nd,e,f\ng,h,i".to_vec();
let reader = CsvReader::new(Cursor::new(bytes));
dbg!(reader.finish().unwrap());
}
If your CSV data is delimited by newlines from your proprietary additional data you can also just use with_skip_rows
and with_n_rows
to skip leading foreign data and only read the number of rows that are actual CSV:
let reader = CsvReader::from_path("tmp.data")
.unwrap()
.with_skip_rows(1);
.with_n_rows(Some(2));
tmp.data
:
some proprietary none csv data
a,b,c
d,e,f
g,h,i
even more proprietary none csv data
Both variants produce the same DataFrame
:
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═════╪═════╡
│ d ┆ e ┆ f │
│ g ┆ h ┆ i │
└─────┴─────┴─────┘
Other than that, I don't think you can use a BufReader
without implementing MmapBytesReader
for a wrapper of it.