Search code examples
pythonpandaswide-format-data

Is there a more efficient way to reformat a pandas dataframe?


I'm working on an anomaly detection model for CPU data and my current dataframes are structured like this (with 10k+ servers and 168 hours):

Server Hour1 Hour2 Hour3
server1 val1.1 val1.2 val1.3
server2 val2.1 val 2.2 val2.3

I need it to be in a structure like this:

Server time Value
server1 0 value0
server1 1 value1
server2 0 value0
server2 1 value1

The problem is there are 10k+ servers and 168 hourly values for each, so it's taking an eternity to iterate over. Is there a more efficient way to do this transformation?

My current attempt is creating a new df with nested for loops like so:

for index, row in df.iterrows():
    for n in range(0,167):
        new_df.loc[len(new_df.index)] = row
        new_df.iat[len(new_df.index)-1, 2] = n

for index, row in new_df.iterrows():
    for i, r in df.iterrows():
        new_df_ts = row[2]
        if(row[0] == r[0]):
            new_df.iat[index, 3] = df.iat[i, 2 + new_df_ts]



Solution

  • Use wide_to_long

    import pandas as pd
    import io
    df = pd.read_csv(io.StringIO("""
    Server  Hour1   Hour2   Hour3
    server1 val1.1  val1.2  val1.3
    server2 val2.1  val2.2  val2.3"""), sep="\t")
    df = pd.wide_to_long(df, "Hour", "Server", "Time") \
        .rename(columns={"Hour": "Value"}) \
        .reset_index()
    

    which results in

        Server  Time    Value
    0  server1     1   val1.1
    1  server2     1   val2.1
    2  server1     2   val1.2
    3  server2     2   val2.2
    4  server1     3   val1.3
    5  server2     3   val2.3
    

    Alternatively, use melt and then strip the Hour text out of the Time column.

    df = df.melt("Server", var_name="Time")
    df["Time"] = df["Time"].str.strip("Hour").astype(int)
    

    Then do whatever additional processing is required on the Time and Value columns, such as subtracting 1 from the time period. Use df.replace or df["Value"].str.replace to alter values if needed.