Complete rust beginner here coming from python. I would like to use rust-polars to read a compressed GTF (*.gtf.gz) file:
let schema = Arc::new(Schema::new(vec![
Field::new("contigName", DataType::Categorical),
Field::new("source", DataType::Utf8),
Field::new("feature", DataType::Categorical),
Field::new("start", DataType::Int64),
Field::new("end", DataType::Int64),
Field::new("score", DataType::Float32),
Field::new("strand", DataType::Categorical),
Field::new("frame", DataType::Categorical),
Field::new("attribute", DataType::Utf8),
]));
let mut df = CsvReader::from_path(r).unwrap()
.with_delimiter(b'\t')
.with_schema(&schema)
.with_comment_char(Some(b'#'))
.with_n_threads(Some(1)) // comment for multithreading
.with_encoding(CsvEncoding::LossyUtf8)
.has_header(false)
.finish()?;
let test = df.head(Some(10));
println!("{}", test);
However, I end up with a number of issues:
io::BufReader::new(GzDecoder::new(f))
instead of the file, but that fails.Hi there are a few questions here at once. I will try to answer the ones I can.
How to tell Polars that the file is compressed?
You don't have to. You only have to compile polars with the decompress
or decompress-fast
feature flag. (The firs one is rust native, the latter needs a c-compiler).
How to parse Categorical columns
You set the schema to DataType::Categorical
, or you first parse as Utf8
and then cast later.
df.may_apply("some_utf8_column", |s| s.cast(&DataType::Categorical));
How to handle possibly missing or additional columns?
I don't know what you mean by handle?
How to read a file which has '#' as header and '##' as comment?
Polars currently only allows a single comment char. You can set this comment char and every line that starts with this character will be ignored.