I need to convert a csv file to apache arrow.
Here is the structure of my csv file (much more rows than that exerpt):
Date,Value,High,Low,Entry
1209920400,1413.50,1413.50,1412.75,1413.00
1209920580,1413.25,1414.00,1413.25,1413.75
1209921240,1413.75,1414.00,1413.25,1413.50
1209921300,1413.25,1413.25,1413.00,1413.00
1209921600,1413.25,1413.25,1412.75,1412.75
1209921780,1413.00,1413.00,1413.00,1413.00
1209921900,1413.00,1413.00,1412.75,1412.75
1209921960,1412.50,1412.50,1412.50,1412.50
1209922800,1412.75,1412.75,1412.75,1412.75
1209923100,1412.75,1413.50,1412.75,1413.25
1209923400,1412.75,1412.75,1412.50,1412.50
1209926940,1413.75,1414.00,1413.50,1413.50
1209930420,1413.75,1414.25,1413.75,1414.00
So far I produced this piece of code to infer the schema and create the arrow file:
use arrow::{
error::ArrowError,
csv::ReaderBuilder,
ipc::writer::FileWriter
};
use std::sync::Arc;
use std::{fs::File};
fn main() -> Result<(), ArrowError> {
let input = "my_data.csv";
let output = "my_data.arrow";
let delimiter: u8 = b',';
let max_read_records: Option<usize> = Some(100);
let has_header = true;
let schema = arrow_csv::reader::infer_schema_from_files(&[input.to_string()], delimiter, max_read_records, has_header).unwrap();
println!("{:?}", schema);
let file = File::open(input).unwrap();
let csv_reader = ReaderBuilder::new(Arc::new(schema)).build(file).unwrap();
let mut writer = FileWriter::try_new(File::create(output)?, csv_reader.schema().as_ref())?;
for batch in csv_reader {
match batch {
Ok(batch) => writer.write(&batch)?,
Err(error) => return Err(error),
}
}
let _ = writer.finish();
Ok(())
}
The code compiles, and then produces 2 outputs.
1- Prints the schema to console:
Schema {
fields:[
Field { name: "Date", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
Field { name: "Value", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
Field { name: "High", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
Field { name: "Low", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
Field { name: "Entry", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
],
metadata: {}
}
2- Prints an error to console:
Error: ParseError("Error while parsing value Date for column 0 at line 0")
First it feels to me that the inferred Schema is correct. But then I don't get the error. Why it can infer a correct Schema but then not be able to parse some value right away?
Whatever I try, I am not able to get rid of the error, and don't really get what's going wrong. I tried to reduce my CSV file to fewer and/or simpler schema, and the issue remains the same.
This happens because ReaderBuilder
by default expects the csv data only (e.g. without the header row).
You can however specify manually that the given csv data does have a header using .has_header(true)
.
Here is the full code:
use arrow::{
error::ArrowError,
csv::ReaderBuilder,
ipc::writer::FileWriter
};
use std::sync::Arc;
use std::{fs::File};
fn main() -> Result<(), ArrowError> {
let input = "my_data.csv";
let output = "my_data.arrow";
let delimiter: u8 = b',';
let max_read_records: Option<usize> = Some(100);
let has_header = true;
let schema = arrow_csv::reader::infer_schema_from_files(&[input.to_string()], delimiter, max_read_records, has_header).unwrap();
println!("{:?}", schema);
let file = File::open(input).unwrap();
let csv_reader = ReaderBuilder::new(Arc::new(schema)).has_header(true).build(file).unwrap();
let mut writer = FileWriter::try_new(File::create(output)?, csv_reader.schema().as_ref())?;
for batch in csv_reader {
match batch {
Ok(batch) => writer.write(&batch)?,
Err(error) => return Err(error),
}
}
let _ = writer.finish();
Ok(())
}