I have the following strings in a pandas DataFrame in Python3, column string1
and string2
:
import pandas as pd
datainput = [
{ 'string1': 'TTTABCDABCDTTTTT', 'string2': 'ABABABABABABABAA' },
{ 'string1': 'AAAAAAAA', 'string2': 'TTAAAATT' },
{ 'string1': 'TTABCDTTTTT', 'string2': 'ABABABABABA' }
]
df = pd.DataFrame(datainput)
df
string1 string2
0 TTTABCDABCDTTTTT ABABABABABABABAA
1 AAAAAAAA TTAAAATT
2 TTABCDTTTTT ABABABABABA
For each row, strings in columns string1
and string2
are defined to be the same length.
For each row of the DataFrame, the strings may need to be "cleaned" of beginning/trailing letters 'T'. However, for each row, the strings need to both be stripped of the same number of characters, so as the strings remain the same length.
The correct output is as follows:
df
string1 string2
0 ABCDABCD BABABABA
1 AAAA AAAA
2 ABCD ABAB
If these were two variables, it would be straightforward to calculate this with strip()
, e.g.
string1 = "TTTABCDABCDTTTTT"
string2 = "ABABABABABABABAA"
length_original = len(string1)
num_left_chars = len(string1) - len(string1.lstrip('T'))
num_right_chars = len(string1.rstrip('T'))
edited = string1[num_left_chars:num_right_chars]
## print(edited)
## 'ABCDABCD'
However, in this case, one needs to iterate through all rows and redefine two rows at once. How could one modify each these strings row by row?
EDIT: My main confusion is, given both columns could T
, how do I re-define them both?
A bit lengthy but gets the job done..
import re
def count_head(s):
head = re.findall('^T+', s)
if head:
return len(head[0])
return 0
def count_tail(s):
tail = re.findall('T+$', s)
if tail:
return len(tail[0])
return 0
df1 = df.copy()
df1['st1_head'] = df1['string1'].apply(count_head)
df1['st2_head'] = df1['string2'].apply(count_head)
df1['st1_tail'] = df1['string1'].apply(count_tail)
df1['st2_tail'] = df1['string2'].apply(count_tail)
df1['length'] = df1['string1'].str.len()
def trim_strings(row):
head = max(row['st1_head'], row['st2_head'])
tail = max(row['st1_tail'], row['st2_tail'])
l = row['length']
return {'string1': row['string1'][head:(l-tail)],
'string2': row['string2'][head:(l-tail)]}
new_df = pd.DataFrame(list(df1.apply(trim_strings, axis=1)))
print(new_df)
output:
string1 string2
0 ABCDABCD BABABABA
1 AAAA AAAA
2 ABCD ABAB
A more compact version:
def trim(st1, st2):
l = len(st1)
head = max(len(st1) - len(st1.lstrip('T')),
len(st2) - len(st2.lstrip('T')))
tail = max(len(st1) - len(st1.rstrip('T')),
len(st2) - len(st2.rstrip('T')))
return (st1[head:(l-tail)],
st2[head:(l-tail)])
new_df = pd.DataFrame(list(
df.apply(lambda r: trim(r['string1'], r['string2']),
axis=1)), columns=['string1', 'string2'])
print(new_df)
The main thing to notice is the df.apply(<your function>, axis=1)
, which lets you do any function (in this case acting on both columns at once), on each row.