Search code examples
pythonpython-polars

Polars Python write_csv error: "pyo3_runtime.PanicException: should not be here"


I am new to using Polars for Python. I am taking a dataframe as an input, converting each column to a numpy array, reassigning values to certain indices in these arrays, deleting specific rows from all of these arrays, and then converting each array to a dataframe and performing a pl.concat (horizontally) on these dataframes. I know that the operation is working because I am able to print the dataframe on Terminal. However, when I try to write the outputDF to a csv file, I get the error below. Any help fixing the error would be greatly appreciated.

P.S.: Here is the link to the sample input data: https://mega.nz/file/u0Z0GS6b#uSD6PDqyHXIEfWDLNQR2VgaqBcBSgeLdSL8lSjTSq3M

thread '<unnamed>' panicked at 'should not be here', /Users/runner/work/polars/polars/polars/polars-core/src/chunked_array/ops/any_value.rs:103:32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/username/Desktop/my_code/segmentation/gpt3_embeddings.py", line 89, in <module>
    result = process_data(df)
  File "/Users/username/Desktop/my_code/segmentation/gpt3_embeddings.py", line 80, in process_data
    outputDF.write_csv("PostProcessing_output.csv")
  File "/Users/username/opt/anaconda3/envs/segmentation/lib/python3.9/site-packages/polars/internals/dataframe/frame.py", line 2033, in write_csv
    self._df.write_csv(
pyo3_runtime.PanicException: should not be here

My code looks as follows:

# PROCESSING THE TRANSCRIBED TEXT:'
def process_data(inputDF):
    # Convert relevant columns in the dataframe to numpy arrays
    phraseArray = inputDF["phrase"].to_numpy()
    actorArray = inputDF["actor"].to_numpy()
    startTimeArray = inputDF["start_time"].to_numpy()
    endTimeArray = inputDF["end_time"].to_numpy()

    # get indicators marking where two consecutive rows have the same actor
    speaker_change = inputDF.select(pl.col("actor").diff())
    speaker_change = speaker_change.rename({"actor": "change"})
    inputDF = inputDF.with_column(speaker_change.to_series(0))
    zero_indices = inputDF.filter(pl.col("change") == 0).select("sentence_index").to_series().to_list() # indices where diff() gave 0
    if len(zero_indices) > 0:
        for index in reversed(zero_indices):
            extract_phrase = phraseArray[index]
            extract_endTime = endTimeArray[index]
            joined_phrases = phraseArray[index - 1] + extract_phrase
            phraseArray[index - 1] = joined_phrases
            endTimeArray[index - 1] = extract_endTime
            phraseArray = np.delete(phraseArray, index)
            actorArray = np.delete(actorArray, index)
            startTimeArray = np.delete(startTimeArray, index)
            endTimeArray = np.delete(endTimeArray, index)
        outputDF = pl.concat([pl.DataFrame(actorArray, columns=["actor"], orient="col"), pl.DataFrame(phraseArray, columns=["phrase"], orient="col"), pl.DataFrame(startTimeArray, columns=["start_time"], orient="col"), pl.DataFrame(endTimeArray, columns=["end_time"], orient="col")], rechunk=True, how="horizontal")
        outputDF = outputDF.with_row_index(name="sentence_index")
        outputDF = outputDF[["sentence_index", "actor", "phrase", "start_time", "end_time"]]
        print(outputDF[342:348])
        outputDF.write_csv("PostProcessing_output.csv")
        return outputDF
    else:
        return inputDF

I tried using df.hstack instead of concat but that did not work either. I also tried rechunk on the dataframe but that did not help either. I think the issue has to do with me converting the columns into numpy arrays and then converting them back to dataframes, but I am not sure.


Solution

  • It looks like you're trying to group consecutive rows based on the actor column and "combine" them?

    Names for these include: "streaks" and "runs".

    Polars has .rle_id() to assign ids to each run. (Run-length encoding)

    df.head(8).with_columns(pl.col.actor.rle_id().alias("id"))
    
    shape: (8, 6)
    ┌────────────────┬───────┬─────────────────────────────────┬────────────────┬────────────────┬─────┐
    │ sentence_index ┆ actor ┆ phrase                          ┆ start_time     ┆ end_time       ┆ id  │
    │ ---            ┆ ---   ┆ ---                             ┆ ---            ┆ ---            ┆ --- │
    │ i64            ┆ i64   ┆ str                             ┆ str            ┆ str            ┆ u32 │
    ╞════════════════╪═══════╪═════════════════════════════════╪════════════════╪════════════════╪═════╡
    │ 0              ┆ 1     ┆  companies. So I don't have an… ┆ 0:00:00        ┆ 0:00:28.125000 ┆ 0   │
    │ 1              ┆ 0     ┆  Oh yeah, that's fine.          ┆ 0:00:28.125000 ┆ 0:00:29.625000 ┆ 1   │
    │ 2              ┆ 1     ┆  Okay, good. And so I have a f… ┆ 0:00:29.625000 ┆ 0:00:38.625000 ┆ 2   │
    │ 3              ┆ 0     ┆  I'm in the parking lot, yeah?… ┆ 0:00:38.625000 ┆ 0:00:41.375000 ┆ 3   │
    │ 4              ┆ 1     ┆  Thank you.                     ┆ 0:00:41.375000 ┆ 0:00:42.125000 ┆ 4   │ # <-
    │ 5              ┆ 1     ┆  And so when we get ready for … ┆ 0:00:42.375000 ┆ 0:01:44.125000 ┆ 4   │ # <-
    │ 6              ┆ 0     ┆  Yeah, let's just get started.  ┆ 0:01:44.125000 ┆ 0:01:45.375000 ┆ 5   │
    │ 7              ┆ 1     ┆  Okay, let's do it. So first o… ┆ 0:01:45.375000 ┆ 0:01:52.625000 ┆ 6   │
    └────────────────┴───────┴─────────────────────────────────┴────────────────┴────────────────┴─────┘
    
    • sentence_index 4 and 5 end up with the same id

    This id can be used to group_by() with.

    It looks like you want to aggregate:

    • the .last() end_time
    • .str.join() the phrase values
    • the .first() for each remaining columns
    (
       df
       .group_by(pl.col.actor.rle_id().alias("id"))
       .agg(
          pl.col("actor").first(),
          pl.col("sentence_index").first(),
          pl.col("phrase").str.join(),
          pl.col("start_time").first(),
          pl.col("end_time").last()
       )
       .sort("sentence_index")
       .drop("id", "sentence_index")
       .with_row_index("sentence_index")
    )
    
    shape: (432, 5)
    ┌────────────────┬───────┬─────────────────────────────────┬────────────────┬────────────────┐
    │ sentence_index ┆ actor ┆ phrase                          ┆ start_time     ┆ end_time       │
    │ ---            ┆ ---   ┆ ---                             ┆ ---            ┆ ---            │
    │ u32            ┆ i64   ┆ str                             ┆ str            ┆ str            │
    ╞════════════════╪═══════╪═════════════════════════════════╪════════════════╪════════════════╡
    │ 0              ┆ 1     ┆  companies. So I don't have an… ┆ 0:00:00        ┆ 0:00:28.125000 │
    │ 1              ┆ 0     ┆  Oh yeah, that's fine.          ┆ 0:00:28.125000 ┆ 0:00:29.625000 │
    │ 2              ┆ 1     ┆  Okay, good. And so I have a f… ┆ 0:00:29.625000 ┆ 0:00:38.625000 │
    │ 3              ┆ 0     ┆  I'm in the parking lot, yeah?… ┆ 0:00:38.625000 ┆ 0:00:41.375000 │
    │ 4              ┆ 1     ┆  Thank you. And so when we get… ┆ 0:00:41.375000 ┆ 0:01:44.125000 │
    │ …              ┆ …     ┆ …                               ┆ …              ┆ …              │
    │ 427            ┆ 0     ┆  Okay, so I usually during Chr… ┆ 0:50:30.545000 ┆ 0:51:03.325000 │
    │ 428            ┆ 1     ┆  So it's really about the flav… ┆ 0:51:03.325000 ┆ 0:51:09.575000 │
    │ 429            ┆ 0     ┆  Exactly. There's a lot of var… ┆ 0:51:09.575000 ┆ 0:51:12.825000 │
    │ 430            ┆ 1     ┆  Okay. Okay. Great. All right,… ┆ 0:51:12.825000 ┆ 0:51:34.825000 │
    │ 431            ┆ 0     ┆  Okay. Okay. Thank you. That's… ┆ 0:51:34.825000 ┆ 0:51:38.880000 │
    └────────────────┴───────┴─────────────────────────────────┴────────────────┴────────────────┘