I am new to using Polars for Python. I am taking a dataframe as an input, converting each column to a numpy array, reassigning values to certain indices in these arrays, deleting specific rows from all of these arrays, and then converting each array to a dataframe and performing a pl.concat (horizontally) on these dataframes. I know that the operation is working because I am able to print the dataframe on Terminal. However, when I try to write the outputDF to a csv file, I get the error below. Any help fixing the error would be greatly appreciated.
P.S.: Here is the link to the sample input data: https://mega.nz/file/u0Z0GS6b#uSD6PDqyHXIEfWDLNQR2VgaqBcBSgeLdSL8lSjTSq3M
thread '<unnamed>' panicked at 'should not be here', /Users/runner/work/polars/polars/polars/polars-core/src/chunked_array/ops/any_value.rs:103:32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "/Users/username/Desktop/my_code/segmentation/gpt3_embeddings.py", line 89, in <module>
result = process_data(df)
File "/Users/username/Desktop/my_code/segmentation/gpt3_embeddings.py", line 80, in process_data
outputDF.write_csv("PostProcessing_output.csv")
File "/Users/username/opt/anaconda3/envs/segmentation/lib/python3.9/site-packages/polars/internals/dataframe/frame.py", line 2033, in write_csv
self._df.write_csv(
pyo3_runtime.PanicException: should not be here
My code looks as follows:
# PROCESSING THE TRANSCRIBED TEXT:'
def process_data(inputDF):
# Convert relevant columns in the dataframe to numpy arrays
phraseArray = inputDF["phrase"].to_numpy()
actorArray = inputDF["actor"].to_numpy()
startTimeArray = inputDF["start_time"].to_numpy()
endTimeArray = inputDF["end_time"].to_numpy()
# get indicators marking where two consecutive rows have the same actor
speaker_change = inputDF.select(pl.col("actor").diff())
speaker_change = speaker_change.rename({"actor": "change"})
inputDF = inputDF.with_column(speaker_change.to_series(0))
zero_indices = inputDF.filter(pl.col("change") == 0).select("sentence_index").to_series().to_list() # indices where diff() gave 0
if len(zero_indices) > 0:
for index in reversed(zero_indices):
extract_phrase = phraseArray[index]
extract_endTime = endTimeArray[index]
joined_phrases = phraseArray[index - 1] + extract_phrase
phraseArray[index - 1] = joined_phrases
endTimeArray[index - 1] = extract_endTime
phraseArray = np.delete(phraseArray, index)
actorArray = np.delete(actorArray, index)
startTimeArray = np.delete(startTimeArray, index)
endTimeArray = np.delete(endTimeArray, index)
outputDF = pl.concat([pl.DataFrame(actorArray, columns=["actor"], orient="col"), pl.DataFrame(phraseArray, columns=["phrase"], orient="col"), pl.DataFrame(startTimeArray, columns=["start_time"], orient="col"), pl.DataFrame(endTimeArray, columns=["end_time"], orient="col")], rechunk=True, how="horizontal")
outputDF = outputDF.with_row_index(name="sentence_index")
outputDF = outputDF[["sentence_index", "actor", "phrase", "start_time", "end_time"]]
print(outputDF[342:348])
outputDF.write_csv("PostProcessing_output.csv")
return outputDF
else:
return inputDF
I tried using df.hstack instead of concat but that did not work either. I also tried rechunk on the dataframe but that did not help either. I think the issue has to do with me converting the columns into numpy arrays and then converting them back to dataframes, but I am not sure.
It looks like you're trying to group consecutive rows based on the actor column and "combine" them?
Names for these include: "streaks" and "runs".
Polars has .rle_id()
to assign ids to each run. (Run-length encoding)
df.head(8).with_columns(pl.col.actor.rle_id().alias("id"))
shape: (8, 6)
┌────────────────┬───────┬─────────────────────────────────┬────────────────┬────────────────┬─────┐
│ sentence_index ┆ actor ┆ phrase ┆ start_time ┆ end_time ┆ id │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ str ┆ str ┆ u32 │
╞════════════════╪═══════╪═════════════════════════════════╪════════════════╪════════════════╪═════╡
│ 0 ┆ 1 ┆ companies. So I don't have an… ┆ 0:00:00 ┆ 0:00:28.125000 ┆ 0 │
│ 1 ┆ 0 ┆ Oh yeah, that's fine. ┆ 0:00:28.125000 ┆ 0:00:29.625000 ┆ 1 │
│ 2 ┆ 1 ┆ Okay, good. And so I have a f… ┆ 0:00:29.625000 ┆ 0:00:38.625000 ┆ 2 │
│ 3 ┆ 0 ┆ I'm in the parking lot, yeah?… ┆ 0:00:38.625000 ┆ 0:00:41.375000 ┆ 3 │
│ 4 ┆ 1 ┆ Thank you. ┆ 0:00:41.375000 ┆ 0:00:42.125000 ┆ 4 │ # <-
│ 5 ┆ 1 ┆ And so when we get ready for … ┆ 0:00:42.375000 ┆ 0:01:44.125000 ┆ 4 │ # <-
│ 6 ┆ 0 ┆ Yeah, let's just get started. ┆ 0:01:44.125000 ┆ 0:01:45.375000 ┆ 5 │
│ 7 ┆ 1 ┆ Okay, let's do it. So first o… ┆ 0:01:45.375000 ┆ 0:01:52.625000 ┆ 6 │
└────────────────┴───────┴─────────────────────────────────┴────────────────┴────────────────┴─────┘
4
and 5
end up with the same id
This id
can be used to group_by()
with.
It looks like you want to aggregate:
.last()
end_time.str.join()
the phrase values.first()
for each remaining columns(
df
.group_by(pl.col.actor.rle_id().alias("id"))
.agg(
pl.col("actor").first(),
pl.col("sentence_index").first(),
pl.col("phrase").str.join(),
pl.col("start_time").first(),
pl.col("end_time").last()
)
.sort("sentence_index")
.drop("id", "sentence_index")
.with_row_index("sentence_index")
)
shape: (432, 5)
┌────────────────┬───────┬─────────────────────────────────┬────────────────┬────────────────┐
│ sentence_index ┆ actor ┆ phrase ┆ start_time ┆ end_time │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ str ┆ str │
╞════════════════╪═══════╪═════════════════════════════════╪════════════════╪════════════════╡
│ 0 ┆ 1 ┆ companies. So I don't have an… ┆ 0:00:00 ┆ 0:00:28.125000 │
│ 1 ┆ 0 ┆ Oh yeah, that's fine. ┆ 0:00:28.125000 ┆ 0:00:29.625000 │
│ 2 ┆ 1 ┆ Okay, good. And so I have a f… ┆ 0:00:29.625000 ┆ 0:00:38.625000 │
│ 3 ┆ 0 ┆ I'm in the parking lot, yeah?… ┆ 0:00:38.625000 ┆ 0:00:41.375000 │
│ 4 ┆ 1 ┆ Thank you. And so when we get… ┆ 0:00:41.375000 ┆ 0:01:44.125000 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 427 ┆ 0 ┆ Okay, so I usually during Chr… ┆ 0:50:30.545000 ┆ 0:51:03.325000 │
│ 428 ┆ 1 ┆ So it's really about the flav… ┆ 0:51:03.325000 ┆ 0:51:09.575000 │
│ 429 ┆ 0 ┆ Exactly. There's a lot of var… ┆ 0:51:09.575000 ┆ 0:51:12.825000 │
│ 430 ┆ 1 ┆ Okay. Okay. Great. All right,… ┆ 0:51:12.825000 ┆ 0:51:34.825000 │
│ 431 ┆ 0 ┆ Okay. Okay. Thank you. That's… ┆ 0:51:34.825000 ┆ 0:51:38.880000 │
└────────────────┴───────┴─────────────────────────────────┴────────────────┴────────────────┘