I am looking for a performant way to count words with Apache Arrow
I tried
compute.count(compute.utf8_split_whitespace(table['text'])))
but that only returns the length of the compute.utf8_split_whitespace(table['text'])
ChunkedArray.
count
counts the number of non null value. You either need value_counts
or count_distinct
depending on what you want to do.compute.value_counts(compute.list_flatten(compute.utf8_split_whitespace(table["text"])))