Search code examples
pythonpandaswhitespace

Sort a pandas df by trailing characters with variable whtespaces


I need to filter data by the end character in a list of strings. The problem I have is that the strings have a variable number of trailing whitespaces so when I use endswith('_h') to separate the groups, the names that have a couple of whitespaces after them are not captured.

How can I filter based on ends with certain characters but allow for the variable whitespace

for example, I want to sort a list of strings by the trailing characters into groupd _d, _h and dt: I have a df with the following

'''

obs_d = [  
"48973a_d   ",   
"48973a_h   ",   
"48973adt   ",   
"art_223wtb_d",
"art_223wtb_h",
"art_223wtbdt"]

values = [
1.3664,
1.02E-02,
191.84,
1.39E-04,
1.53E-02,
14.267]

d = {'obs': obs_d, 'vals': values}
df = pd.DataFrame(data=d)

hds = {}
ddns = {}
dt = {}
for r in range(len(df)):
    if df.iloc[r]['obs'].endswith('_d'):
        ddns[df.iloc[r]['obs']] = df.iloc[r]['vals']'''

Only the art_223wtb_d value is captured because it has no trailing whitespaces. How can I solve this? THanks


Solution

  • IIUC, this should work in your case:

    df.assign(postfix = lambda df: df.obs.str.rstrip().str[-2:]).set_index('postfix').sort_index()
    

    Result:

                      obs        vals
    postfix                          
    _d        48973a_d       1.366400
    _d       art_223wtb_d    0.000139
    _h        48973a_h       0.010200
    _h       art_223wtb_h    0.015300
    dt        48973adt     191.840000
    dt       art_223wtbdt   14.267000