Polars Python write_csv error: "pyo3_runtime.PanicException: should not be here"

I am new to using Polars for Python. I am taking a dataframe as an input, converting each column to a numpy array, reassigning values to certain indices in these arrays, deleting specific rows from all of these arrays, and then converting each array to a dataframe and performing a pl.concat (horizontally) on these dataframes. I know that the operation is working because I am able to print the dataframe on Terminal. However, when I try to write the outputDF to a csv file, I get the error below. Any help fixing the error would be greatly appreciated.

P.S.: Here is the link to the sample input data: https://mega.nz/file/u0Z0GS6b#uSD6PDqyHXIEfWDLNQR2VgaqBcBSgeLdSL8lSjTSq3M

thread '<unnamed>' panicked at 'should not be here', /Users/runner/work/polars/polars/polars/polars-core/src/chunked_array/ops/any_value.rs:103:32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/username/Desktop/my_code/segmentation/gpt3_embeddings.py", line 89, in <module>
    result = process_data(df)
  File "/Users/username/Desktop/my_code/segmentation/gpt3_embeddings.py", line 80, in process_data
    outputDF.write_csv("PostProcessing_output.csv")
  File "/Users/username/opt/anaconda3/envs/segmentation/lib/python3.9/site-packages/polars/internals/dataframe/frame.py", line 2033, in write_csv
    self._df.write_csv(
pyo3_runtime.PanicException: should not be here

My code looks as follows:

# PROCESSING THE TRANSCRIBED TEXT:'
def process_data(inputDF):
    # Convert relevant columns in the dataframe to numpy arrays
    phraseArray = inputDF["phrase"].to_numpy()
    actorArray = inputDF["actor"].to_numpy()
    startTimeArray = inputDF["start_time"].to_numpy()
    endTimeArray = inputDF["end_time"].to_numpy()

    # get indicators marking where two consecutive rows have the same actor
    speaker_change = inputDF.select(pl.col("actor").diff())
    speaker_change = speaker_change.rename({"actor": "change"})
    inputDF = inputDF.with_column(speaker_change.to_series(0))
    zero_indices = inputDF.filter(pl.col("change") == 0).select("sentence_index").to_series().to_list() # indices where diff() gave 0
    if len(zero_indices) > 0:
        for index in reversed(zero_indices):
            extract_phrase = phraseArray[index]
            extract_endTime = endTimeArray[index]
            joined_phrases = phraseArray[index - 1] + extract_phrase
            phraseArray[index - 1] = joined_phrases
            endTimeArray[index - 1] = extract_endTime
            phraseArray = np.delete(phraseArray, index)
            actorArray = np.delete(actorArray, index)
            startTimeArray = np.delete(startTimeArray, index)
            endTimeArray = np.delete(endTimeArray, index)
        outputDF = pl.concat([pl.DataFrame(actorArray, columns=["actor"], orient="col"), pl.DataFrame(phraseArray, columns=["phrase"], orient="col"), pl.DataFrame(startTimeArray, columns=["start_time"], orient="col"), pl.DataFrame(endTimeArray, columns=["end_time"], orient="col")], rechunk=True, how="horizontal")
        outputDF = outputDF.with_row_index(name="sentence_index")
        outputDF = outputDF[["sentence_index", "actor", "phrase", "start_time", "end_time"]]
        print(outputDF[342:348])
        outputDF.write_csv("PostProcessing_output.csv")
        return outputDF
    else:
        return inputDF

I tried using df.hstack instead of concat but that did not work either. I also tried rechunk on the dataframe but that did not help either. I think the issue has to do with me converting the columns into numpy arrays and then converting them back to dataframes, but I am not sure.

Solution

It looks like you're trying to group consecutive rows based on the actor column and "combine" them?

Names for these include: "streaks" and "runs".

Polars has .rle_id() to assign ids to each run. (Run-length encoding)

df.head(8).with_columns(pl.col.actor.rle_id().alias("id"))

shape: (8, 6)
┌────────────────┬───────┬─────────────────────────────────┬────────────────┬────────────────┬─────┐
│ sentence_index ┆ actor ┆ phrase                          ┆ start_time     ┆ end_time       ┆ id  │
│ ---            ┆ ---   ┆ ---                             ┆ ---            ┆ ---            ┆ --- │
│ i64            ┆ i64   ┆ str                             ┆ str            ┆ str            ┆ u32 │
╞════════════════╪═══════╪═════════════════════════════════╪════════════════╪════════════════╪═════╡
│ 0              ┆ 1     ┆  companies. So I don't have an… ┆ 0:00:00        ┆ 0:00:28.125000 ┆ 0   │
│ 1              ┆ 0     ┆  Oh yeah, that's fine.          ┆ 0:00:28.125000 ┆ 0:00:29.625000 ┆ 1   │
│ 2              ┆ 1     ┆  Okay, good. And so I have a f… ┆ 0:00:29.625000 ┆ 0:00:38.625000 ┆ 2   │
│ 3              ┆ 0     ┆  I'm in the parking lot, yeah?… ┆ 0:00:38.625000 ┆ 0:00:41.375000 ┆ 3   │
│ 4              ┆ 1     ┆  Thank you.                     ┆ 0:00:41.375000 ┆ 0:00:42.125000 ┆ 4   │ # <-
│ 5              ┆ 1     ┆  And so when we get ready for … ┆ 0:00:42.375000 ┆ 0:01:44.125000 ┆ 4   │ # <-
│ 6              ┆ 0     ┆  Yeah, let's just get started.  ┆ 0:01:44.125000 ┆ 0:01:45.375000 ┆ 5   │
│ 7              ┆ 1     ┆  Okay, let's do it. So first o… ┆ 0:01:45.375000 ┆ 0:01:52.625000 ┆ 6   │
└────────────────┴───────┴─────────────────────────────────┴────────────────┴────────────────┴─────┘

sentence_index 4 and 5 end up with the same id

This id can be used to group_by() with.

It looks like you want to aggregate:

the .last() end_time
.str.join() the phrase values
the .first() for each remaining columns

(
   df
   .group_by(pl.col.actor.rle_id().alias("id"))
   .agg(
      pl.col("actor").first(),
      pl.col("sentence_index").first(),
      pl.col("phrase").str.join(),
      pl.col("start_time").first(),
      pl.col("end_time").last()
   )
   .sort("sentence_index")
   .drop("id", "sentence_index")
   .with_row_index("sentence_index")
)

shape: (432, 5)
┌────────────────┬───────┬─────────────────────────────────┬────────────────┬────────────────┐
│ sentence_index ┆ actor ┆ phrase                          ┆ start_time     ┆ end_time       │
│ ---            ┆ ---   ┆ ---                             ┆ ---            ┆ ---            │
│ u32            ┆ i64   ┆ str                             ┆ str            ┆ str            │
╞════════════════╪═══════╪═════════════════════════════════╪════════════════╪════════════════╡
│ 0              ┆ 1     ┆  companies. So I don't have an… ┆ 0:00:00        ┆ 0:00:28.125000 │
│ 1              ┆ 0     ┆  Oh yeah, that's fine.          ┆ 0:00:28.125000 ┆ 0:00:29.625000 │
│ 2              ┆ 1     ┆  Okay, good. And so I have a f… ┆ 0:00:29.625000 ┆ 0:00:38.625000 │
│ 3              ┆ 0     ┆  I'm in the parking lot, yeah?… ┆ 0:00:38.625000 ┆ 0:00:41.375000 │
│ 4              ┆ 1     ┆  Thank you. And so when we get… ┆ 0:00:41.375000 ┆ 0:01:44.125000 │
│ …              ┆ …     ┆ …                               ┆ …              ┆ …              │
│ 427            ┆ 0     ┆  Okay, so I usually during Chr… ┆ 0:50:30.545000 ┆ 0:51:03.325000 │
│ 428            ┆ 1     ┆  So it's really about the flav… ┆ 0:51:03.325000 ┆ 0:51:09.575000 │
│ 429            ┆ 0     ┆  Exactly. There's a lot of var… ┆ 0:51:09.575000 ┆ 0:51:12.825000 │
│ 430            ┆ 1     ┆  Okay. Okay. Great. All right,… ┆ 0:51:12.825000 ┆ 0:51:34.825000 │
│ 431            ┆ 0     ┆  Okay. Okay. Thank you. That's… ┆ 0:51:34.825000 ┆ 0:51:38.880000 │
└────────────────┴───────┴─────────────────────────────────┴────────────────┴────────────────┘