Search code examples
pythonpandasdataframelarge-data

Replacing punctuation in a data frame based on punctuation list


Using Canopy and Pandas, I have data frame a which is defined by:

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"]

test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.

Assuming df looks like:


test

%hgh&12

abc123!!!

porkyfries


I want my results to be:


test

hgh12

abc123

porkyfries


Effort so far:

from string import punctuation /-- import punctuation list from python itself

a=pd.read_csv('text.txt')

df=pd.DataFrame(a)

df.columns=["test"] /-- define the dataframe


for p in list(punctuation):

     ...:     df2=df.med.str.replace(p,'')

     ...:     df2=pd.DataFrame(df2);

     ...:     df2

The command above basically just returns me with the same data set. Appreciate any leads.

Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows. Long story short, I need to clean data in a very efficient manner for big data sets.


Solution

  • Use replace with correct regex would be easier:

    In [41]:
    
    import pandas as pd
    pd.set_option('display.notebook_repr_html', False)
    df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
    df
    Out[41]:
             text
    0        test
    1     %hgh&12
    2   abc123!!!
    3  porkyfries
    
    [4 rows x 1 columns]
    

    use regex with the pattern which means not alphanumeric/whitespace

    In [49]:
    
    df['text'] = df['text'].str.replace('[^\w\s]','')
    df
    Out[49]:
             text
    0        test
    1       hgh12
    2      abc123
    3  porkyfries
    
    [4 rows x 1 columns]