Search code examples
rustrust-polars

How to read compressed TSV files (*.gtf.gz) with rust-polars?


Complete rust beginner here coming from python. I would like to use rust-polars to read a compressed GTF (*.gtf.gz) file:

    let schema = Arc::new(Schema::new(vec![
        Field::new("contigName", DataType::Categorical),
        Field::new("source", DataType::Utf8),
        Field::new("feature", DataType::Categorical),
        Field::new("start", DataType::Int64),
        Field::new("end", DataType::Int64),
        Field::new("score", DataType::Float32),
        Field::new("strand", DataType::Categorical),
        Field::new("frame", DataType::Categorical),
        Field::new("attribute", DataType::Utf8),
    ]));

    let mut df = CsvReader::from_path(r).unwrap()
        .with_delimiter(b'\t')
        .with_schema(&schema)
        .with_comment_char(Some(b'#'))
        .with_n_threads(Some(1)) // comment for multithreading
        .with_encoding(CsvEncoding::LossyUtf8)
        .has_header(false)
        .finish()?;

    let test = df.head(Some(10));
    println!("{}", test);

However, I end up with a number of issues:

  • How to tell Polars that the file is compressed?
    I tried passing io::BufReader::new(GzDecoder::new(f)) instead of the file, but that fails.
  • How to parse Categorical columns?
  • How to handle possibly missing or additional columns?
  • How to read a file which has '#' as header and '##' as comment?

Solution

  • Hi there are a few questions here at once. I will try to answer the ones I can.

    How to tell Polars that the file is compressed?

    You don't have to. You only have to compile polars with the decompress or decompress-fast feature flag. (The firs one is rust native, the latter needs a c-compiler).

    How to parse Categorical columns

    You set the schema to DataType::Categorical, or you first parse as Utf8 and then cast later.

    df.may_apply("some_utf8_column", |s| s.cast(&DataType::Categorical));
    

    How to handle possibly missing or additional columns?

    I don't know what you mean by handle?

    How to read a file which has '#' as header and '##' as comment?

    Polars currently only allows a single comment char. You can set this comment char and every line that starts with this character will be ignored.