Search code examples
pythonstringpandasstrip

Modifying strings in a pandas DataFrame by row


I have the following strings in a pandas DataFrame in Python3, column string1 and string2:

import pandas as pd

datainput = [
    { 'string1': 'TTTABCDABCDTTTTT', 'string2': 'ABABABABABABABAA' },
    { 'string1': 'AAAAAAAA', 'string2': 'TTAAAATT' },
    { 'string1': 'TTABCDTTTTT', 'string2': 'ABABABABABA' }
]

df = pd.DataFrame(datainput)

df
            string1           string2
0  TTTABCDABCDTTTTT  ABABABABABABABAA
1          AAAAAAAA          TTAAAATT
2       TTABCDTTTTT       ABABABABABA

For each row, strings in columns string1 and string2 are defined to be the same length.

For each row of the DataFrame, the strings may need to be "cleaned" of beginning/trailing letters 'T'. However, for each row, the strings need to both be stripped of the same number of characters, so as the strings remain the same length.

The correct output is as follows:

df
            string1           string2
0          ABCDABCD      BABABABA
1          AAAA          AAAA
2          ABCD          ABAB

If these were two variables, it would be straightforward to calculate this with strip(), e.g.

string1 = "TTTABCDABCDTTTTT"
string2 = "ABABABABABABABAA"

length_original = len(string1)
num_left_chars = len(string1) - len(string1.lstrip('T'))
num_right_chars = len(string1.rstrip('T'))
edited = string1[num_left_chars:num_right_chars]
## print(edited)
## 'ABCDABCD'

However, in this case, one needs to iterate through all rows and redefine two rows at once. How could one modify each these strings row by row?

EDIT: My main confusion is, given both columns could T, how do I re-define them both?


Solution

  • A bit lengthy but gets the job done..

    import re
    def count_head(s):
        head = re.findall('^T+', s)
        if head:
            return len(head[0])
        return 0
    def count_tail(s):
        tail = re.findall('T+$', s)
        if tail:
            return len(tail[0])
        return 0
    df1 = df.copy()
    df1['st1_head'] = df1['string1'].apply(count_head)
    df1['st2_head'] = df1['string2'].apply(count_head)
    df1['st1_tail'] = df1['string1'].apply(count_tail)
    df1['st2_tail'] = df1['string2'].apply(count_tail)
    df1['length'] = df1['string1'].str.len()
    
    def trim_strings(row):
        head = max(row['st1_head'], row['st2_head'])
        tail = max(row['st1_tail'], row['st2_tail'])
        l = row['length']
        return {'string1': row['string1'][head:(l-tail)],
               'string2': row['string2'][head:(l-tail)]}
    new_df = pd.DataFrame(list(df1.apply(trim_strings, axis=1)))
    print(new_df)
    

    output:

        string1   string2
    0  ABCDABCD  BABABABA
    1      AAAA      AAAA
    2      ABCD      ABAB
    

    A more compact version:

    def trim(st1, st2):
        l = len(st1)
        head = max(len(st1) - len(st1.lstrip('T')), 
                  len(st2) - len(st2.lstrip('T')))
        tail = max(len(st1) - len(st1.rstrip('T')), 
                  len(st2) - len(st2.rstrip('T')))
        return (st1[head:(l-tail)],
               st2[head:(l-tail)])
    
    new_df = pd.DataFrame(list(
        df.apply(lambda r: trim(r['string1'], r['string2']), 
             axis=1)), columns=['string1', 'string2'])
    print(new_df)
    

    The main thing to notice is the df.apply(<your function>, axis=1), which lets you do any function (in this case acting on both columns at once), on each row.