Search code examples
pythonpyarrowapache-arrow

Counting words with pyarrow


I am looking for a performant way to count words with Apache Arrow

I tried

compute.count(compute.utf8_split_whitespace(table['text'])))

but that only returns the length of the compute.utf8_split_whitespace(table['text']) ChunkedArray.


Solution

    • You need to flatten the ListArray returned by utf8_split_whitespace
    • count counts the number of non null value. You either need value_counts or count_distinct depending on what you want to do.
    compute.value_counts(compute.list_flatten(compute.utf8_split_whitespace(table["text"])))