Search code examples
pythonpandasslicedata-analysis

slicing a dataframe by text within a text


I have a dataframe which looks like this:

import pandas as pd

saf_data = {'col1': ['U1', 'U2', 'U3', 'U4'], 'col2': ['1', '2|6', '4a|6a', '6b']}

saf_df = pd.DataFrame(saf_data)

I want to do the following

  1. keep all elements in col1 that contain the "text" 6 in col2

So the expected outcome should contain ['U2', 'U3', 'U4'] or be a dataframe that looks like this:

 col1   col2
0   U2    2|6
1   U3  4a|6a
2   U4     6b

Can someone help me how I can do that? Please I am looking for the most efficient and general code, because my dataframe is quite large.


Solution

  • IIUC str.contains

    saf_df[saf_df.col2.str.contains('6')]
    Out[51]: 
      col1   col2
    1   U2    2|6
    2   U3  4a|6a
    3   U4     6b
    

    Since you mentioned your df is quit large, I will recommend using for loop here, see the timing below

    %timeit saf_df[saf_df.col2.str.contains('6')]
    10 loops, best of 3: 20.1 ms per loop
    %timeit saf_df[['6' in x for x in saf_df.col2]]
    100 loops, best of 3: 3.14 ms per loop