In my Rust project I am loading documents from Mongo and deserialize them into serde_json Values:
match cursor.deserialize_current() {
Ok(d) => {
let doc = serde_json::to_value(&d).unwrap();
doc_vec.push(doc);
}
After that I create an arrow RecordBatch
using the decoder:
let mut decoder = ReaderBuilder::new(schema.clone()).build_decoder().unwrap();
if !doc_vec.is_empty() {
decoder.serialize(&doc_vec).unwrap();
let batch = decoder.flush().unwrap().unwrap();
My schema is:
let schema = Schema::new(vec![
Field::new("Amount", DataType::Float32, false),
Field::new(
"Country",
DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
false,
),
]);
The code fails with:
called `Result::unwrap()` on an `Err` value: NotYetImplemented("Support for Dictionary(UInt16, Utf8) in JSON reader")called `Result::unwrap()` on an `Err` value: NotYetImplemented("Support for Dictionary(UInt16, Utf8) in JSON reader")
I want the country to be one-hot-encoded when I send it to a pyarrow client via arrow flight, to convert it to a Pandas dataframe afterwards.
Can you guide me how to continue from here? I'm quite new to all of the used technologies.
A workaround would be to read the column as Utf8
and then use the cast
kernel to convert it to dictionary encoding.
From my understanding though, one-hot encoding is different than dictionary encoding. You could get one-hot encoded boolean columns by using the comparison kernels, comparing against the distinct "country" values.