Search code examples
pythonpython-3.xreplaceexport-to-csv

Python: How to replace a lots of strings


I'm trying to replace a lots of strings (only three strings example but I have thousands strings actually) to other strings defined on "replaceWord".

  • "replaceWord" has no regularity.

However,code i wrote dose not work as I expected.

After running script, output is as below:

     before     after
0  test1234  test1234
1  test1234  test1234
2  test1234      1349
3  test1234  test1234
4  test1234  test1234

I need output as below;

  before    after
1 test1234  1349
2 test9012  te1210st
3 test5678  8579
4 april     I was born August
5 mcdonalds i like checkin

script

import os.path, time, re
import pandas as pd
import csv


body01_before="test1234"
body02_before="test9012"
body03_before="test5678"
body04_before="i like mcdonalds"
body05_before="I was born april"

replaceWord = [
                ["test9012","te1210st"],
                ["test5678","8579"],
                ["test1234","1349"],
                ["april","August"],
                ["mcdonalds","chicken"],

]

cols = ['before','after']
df = pd.DataFrame(index=[], columns=cols)

for word in replaceWord:
    
    body01_after = re.sub(word[0], word[1], body01_before)
    body02_after = re.sub(word[0], word[1], body02_before)
    body03_after = re.sub(word[0], word[1], body03_before)
    body04_after = re.sub(word[0], word[1], body04_before)
    body05_after = re.sub(word[0], word[1], body05_before)

    df=df.append({'before':body01_before,'after':body01_after}, ignore_index=True)
    
#df.head()
print(df)

df.to_csv('test_replace.csv')

Solution

  • Use regular expressions to capture the non-digits (\D+) as the first group and the digits (\d+) as the second group. replace the text by starting with the second group \2 then first group \1

    df['after'] = df['before'].str.replace(r'(\D+)(\d+)', r'\2\1', regex = True)
    
    df
         before     after
    1  test1234  1234test
    2  test9012  9012test
    3  test5678  5678test
    

    Edit

    Seems that you do not have the dataset. You have variables:

    body01_before="test1234"
    body02_before="test9012"
    body03_before="test5678"
    body04_before="i like mcdonalds"
    body05_before="I was born april"
    
    replaceWord = [
                    ["test9012","te1210st"],
                    ["test5678","8579"],
                    ["test1234","1349"],
                    ["april","August"],
                    ["mcdonalds","chicken"],
    
    ]
    
    # Gather the variables in a list
    vars = re.findall('body0\\d[^,]+', ','.join(globals().keys()))
    df = pd.DataFrame(vars, columns = ['before_1'])
    # Obtain the values of the variable
    df['before'] = df['before_1'].apply(lambda x:eval(x))
    
    # replacement function
    repl = lambda x: x[0] if (rp:=dict(replaceWord).get(x[0])) is None else rp
    
    # Do the replacement
    df['after'] = df['before'].str.replace('(\\w+)',repl, regex= True)
    
    df
            before_1            before              after
    0  body01_before          test1234               1349
    1  body02_before          test9012           te1210st
    2  body03_before          test5678               8579
    3  body04_before  i like mcdonalds     i like chicken
    4  body05_before  I was born april  I was born August