Search code examples
rustapache-arrowrust-arrow2

Convert CSV to Apache Arrow in Rust


I need to convert a csv file to apache arrow.

Here is the structure of my csv file (much more rows than that exerpt):

Date,Value,High,Low,Entry
1209920400,1413.50,1413.50,1412.75,1413.00
1209920580,1413.25,1414.00,1413.25,1413.75
1209921240,1413.75,1414.00,1413.25,1413.50
1209921300,1413.25,1413.25,1413.00,1413.00
1209921600,1413.25,1413.25,1412.75,1412.75
1209921780,1413.00,1413.00,1413.00,1413.00
1209921900,1413.00,1413.00,1412.75,1412.75
1209921960,1412.50,1412.50,1412.50,1412.50
1209922800,1412.75,1412.75,1412.75,1412.75
1209923100,1412.75,1413.50,1412.75,1413.25
1209923400,1412.75,1412.75,1412.50,1412.50
1209926940,1413.75,1414.00,1413.50,1413.50
1209930420,1413.75,1414.25,1413.75,1414.00

So far I produced this piece of code to infer the schema and create the arrow file:

use arrow::{
    error::ArrowError,
    csv::ReaderBuilder,
    ipc::writer::FileWriter
};
use std::sync::Arc;
use std::{fs::File};

fn main() -> Result<(), ArrowError> {

    let input = "my_data.csv";
    let output = "my_data.arrow";
    let delimiter: u8 = b',';
    let max_read_records: Option<usize> = Some(100);
    let has_header = true;

    let schema = arrow_csv::reader::infer_schema_from_files(&[input.to_string()], delimiter, max_read_records, has_header).unwrap();

    println!("{:?}", schema);

    let file = File::open(input).unwrap();
    let csv_reader = ReaderBuilder::new(Arc::new(schema)).build(file).unwrap();

    let mut writer = FileWriter::try_new(File::create(output)?, csv_reader.schema().as_ref())?;

    for batch in csv_reader {
        match batch {
            Ok(batch) => writer.write(&batch)?,
            Err(error) => return Err(error),
        }
    }

    let _ = writer.finish();

    Ok(())
}

The code compiles, and then produces 2 outputs.

1- Prints the schema to console:

Schema {
  fields:[
    Field { name: "Date", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
    Field { name: "Value", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
    Field { name: "High", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
    Field { name: "Low", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
    Field { name: "Entry", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
  ],
  metadata: {} 
}

2- Prints an error to console:

Error: ParseError("Error while parsing value Date for column 0 at line 0")

First it feels to me that the inferred Schema is correct. But then I don't get the error. Why it can infer a correct Schema but then not be able to parse some value right away?

Whatever I try, I am not able to get rid of the error, and don't really get what's going wrong. I tried to reduce my CSV file to fewer and/or simpler schema, and the issue remains the same.


Solution

  • This happens because ReaderBuilder by default expects the csv data only (e.g. without the header row).

    You can however specify manually that the given csv data does have a header using .has_header(true).

    Here is the full code:

    use arrow::{
        error::ArrowError,
        csv::ReaderBuilder,
        ipc::writer::FileWriter
    };
    use std::sync::Arc;
    use std::{fs::File};
    
    fn main() -> Result<(), ArrowError> {
    
        let input = "my_data.csv";
        let output = "my_data.arrow";
        let delimiter: u8 = b',';
        let max_read_records: Option<usize> = Some(100);
        let has_header = true;
    
        let schema = arrow_csv::reader::infer_schema_from_files(&[input.to_string()], delimiter, max_read_records, has_header).unwrap();
    
        println!("{:?}", schema);
    
        let file = File::open(input).unwrap();
        let csv_reader = ReaderBuilder::new(Arc::new(schema)).has_header(true).build(file).unwrap();
    
        let mut writer = FileWriter::try_new(File::create(output)?, csv_reader.schema().as_ref())?;
    
        for batch in csv_reader {
            match batch {
                Ok(batch) => writer.write(&batch)?,
                Err(error) => return Err(error),
            }
        }
    
        let _ = writer.finish();
    
        Ok(())
    }