Search code examples
pythonpandasrows

How to rearrange rows on Pandas based on based on string matching conditions on unnamed columns?


We have a Panda data frame as below.

   Unnamed:0               T1    T2    T3   ...  T120
 0  cheetah Running         x1    x2    x1   ...   x3
 1  Running Jaguar          x1    x10   x3   ...   x7
 2  Cougar Running          x1    x2    x1   ...   x3
 3  Bengal Tiger Running    x5    x2    x4   ...   x17
 4  Sleeping Bali Tiger     x55   x61   x11  ...   x31
 5  Javan Leopard Sleeping  x42   x67   x17  ...   x34
 6  Leopard Running         x2    x5    x2   ...   x3
 7  Bengal Tiger Running    x5    x2    x4   ...   x17
..      ...                ...   ...   ...  ...   ...
199 Florida Panther Eating  x71   x80   x101 ...   x94
200 Running Eastern Cougar  x5    x1     x2  ...   x3
201 Congo Lion Sleeping     x57   x61    x14  ...  x38

We are trying to reorganize this data frame as below. In the above data frame, the first column is unnamed. We try to check for known common behaviors such as "Running", "Sleeping", etc, in that unnamed column and try to rearrange the data frame as below.

        Unnamed:0               T1    T2    T3   ...  T120
     0  cheetah Running         x1    x2    x1   ...   x3
     1  Running Jaguar          x1    x10   x3   ...   x7
     2  Cougar Running          x1    x2    x1   ...   x3
     3  Bengal Tiger Running    x5    x2    x4   ...   x17
     4  Running Eastern Cougar  x5    x1     x2  ...   x3
     5  Bengal Tiger Running    x5    x2    x4   ...   x17
     6  Leopard Running         x2    x5    x2   ...   x3
     4  Sleeping Bali Tiger     x55   x61   x11  ...   x31
     5  Javan Leopard Sleeping  x42   x67   x17  ...   x34
     6  Congo Lion Sleeping     x57   x61    x14  ...  x38  
     ..      ...                ...   ...   ...  ...   ...
    201 Florida Panther Eating  x71   x80   x101 ...   x94 
    

I attempted it in the following way but I added a name to the column. I tried df[df.columns.str.contains('^Unnamed')] but no luck.

import pandas as pd

df = pd.read_csv('a_behav_cat.csv')

df_new = pd.DataFrame()
df_new = df_new.append(df[df["name"].str.contains("Running")])
df_new = df_new.append(df[df["name"].str.contains("Sleeping")])
print(df_new.to_string())

QUESTION 1: I think there should be a better or Pythonic way to do this. Appreciate your kind consideration on this. Also this checks for the exact match of the string which is not ideal as the dataset may have simple "running" and simple "sleeping" :), etc. I tried .lower() function which didn't work.

PURPOSE: The purpose of this is to identify for a single observation how many different x categories are there. Here T1, T2, T3, ... T120 are observations. We need to identify how many common values are there for each observation. i.e. for T1, for "Running" there are 3, 'x1's and 3 'x5's and 1 'x2'.

To do this first we thought of rearranging the data frame as described above.

Yet, we are not sure whether this rearrangement is required for the purpose. Also, the output looks multi-dimensional. That is for T1, for "Running" how many x1, x3, x5s are there need to be stored. Similarly, this needs to be applied to other behaviors such as "eating", "sleeping", etc.

QUESTION2: What is the best approach to achieve this? Any data structure that is suitable for this purpose? Are there a better way of achieving the purpose stated above without rearranging the dataframe?

Here's a sample csv if you wanna do a test.

,T1,T2,T3,T4
cheetah Running,x1,x2,x1,x3
Running Jaguar,x1,x10,x3,x7
Cougar Running,x1,x2,x1,x3
Bengal Tiger Running,x5,x2,x4,x17
Sleeping Bali Tiger,x55,x61,x11,x31
Javan Leopard Sleeping,x42,x67,x17,x34
Leopard Running,x2,x5,x2,x3
Bengal Tiger Running,x5,x2,x4,x17
Florida Panther Eating,x71,x80,x101,x94
Running Eastern Cougar,x5,x1,x2,x3
Congo Lion Sleeping,x57,x61,x14,x38

Solution

  • IIUC, you can use a dict mapping to set the right category to the row:

    # Your list of patterns
    MAPPING = {'S': ['sleep', 'sleeping'],
               'R': ['run', 'running'],
               'E': ['eat', 'eating']}
    
    # Reverse the mapping (swap keys and values)
    rev = {v: k for k, l in MAPPING.items() for v in l}
    
    # Create the regex pattern
    pat = fr"\b({'|'.join(rev)})\b"
    
    # Extract from data
    df['CAT'] = df['Unnamed: 0'].str.lower().str.extract(pat, expand=False).map(rev)
    

    Output:

    >>> df
                    Unnamed: 0   T1   T2    T3   T4 CAT
    0          cheetah Running   x1   x2    x1   x3   R
    1           Running Jaguar   x1  x10    x3   x7   R
    2           Cougar Running   x1   x2    x1   x3   R
    3     Bengal Tiger Running   x5   x2    x4  x17   R
    4      Sleeping Bali Tiger  x55  x61   x11  x31   S
    5   Javan Leopard Sleeping  x42  x67   x17  x34   S
    6          Leopard Running   x2   x5    x2   x3   R
    7     Bengal Tiger Running   x5   x2    x4  x17   R
    8   Florida Panther Eating  x71  x80  x101  x94   E
    9   Running Eastern Cougar   x5   x1    x2   x3   R
    10     Congo Lion Sleeping  x57  x61   x14  x38   S