Search code examples
pythonpandasdataframemultiple-columns

Clean column in data frame


I'm trying to clean one column which contains the ID number which is starting from S and 7 numbers, e.g.: 'S1234567' and save only this number into new column. I started with this column named Remarks, this is an example of the data inside:

         Remarks
0   S0252508 Shipment UK
1   S0255111 Shipment UK
2   S0256352 Shipment UK
3   S0259138 Shipment UK
4   S0260425 Shipment US

I've managed to separate those rows which has the format S1234567 + text using this code:

merged_out['Remarks'] = merged_out['Remarks'].replace("\t", "\r")
merged_out['Remarks'] = merged_out['Remarks'].replace("\n", "\r")
s = merged_out['Remarks'].str.split("\r").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Remarks'
del merged_out['Remarks']
merged_out = merged_out.join(s)
merged_out[['Number','Remarks']] = merged_out.Remarks.str.split(" ", 1, expand=True)

After creating a data frame I found that there are a lot of mistakes inside of that column because the data are written there manually, so there are some examples of those wrong records:

    Number
0. Pallets:
1. S0246734/S0246735/S0246736
3. delivery
4. S0258780 31  cok
5. S0246732-
6. 2
7. ok
8. nan

And this is only the wrong data which are in the Number column, I will need to clear this and save only those which has the correct number, if there is sth. like that: S0246732/S0246736/S0246738, then I need to have separated row for each number with the same data as it was for this record. For the other one I need to save those which contains the number, the other should have the null value.


Solution

  • Here is a regex approach that will do what I think your question asks:

    import pandas as pd
    merged_out = pd.DataFrame({
    'Remarks':[
    'S0252508 Shipment UK',
    'S0255111 Shipment UK',
    'S0256352 Shipment UK',
    'S0259138/S0259139 Shipment UK',
    'S12345678 Shipment UK',
    'S0260425 Shipment US']
    })
    
    pat = r'(?:(\bS\d{7})/)*(\bS\d{7}\b)'
    df = merged_out.Remarks.str.extractall(pat)
    df = ( pd.concat([
        pd.DataFrame(df.unstack().apply(lambda row: row.dropna().tolist(), axis=1), columns=['Number']), 
        merged_out], 
        axis=1).explode('Number') )
    df.Remarks = df.Remarks.str.replace(pat + r'\s*', '', regex=True)
    

    Input:

                              Remarks
    0           S0252508 Shipment UK
    1           S0255111 Shipment UK
    2           S0256352 Shipment UK
    3  S0259138/S0259139 Shipment UK
    4          S12345678 Shipment UK
    5           S0260425 Shipment US
    

    Output:

         Number                 Remarks
    0  S0252508            Shipment UK
    1  S0255111            Shipment UK
    2  S0256352            Shipment UK
    3  S0259138            Shipment UK
    3  S0259139            Shipment UK
    5  S0260425            Shipment US
    4       NaN  S12345678 Shipment UK
    

    Explanation:

    • with Series.str.extractall(), use a pattern to obtain 0 or more occurrences of word boundary \b followed by S followed by 7 digits and a 1 occurrence of S followed by 7 digits (flanked by word boundaries \b)
    • use unstack() to eliminate multiple index levels
    • use apply() with dropna() and tolist() to create a new dataframe with a Number column containing a list of numbers for each row
    • use explode() to add new rows for lists with more than one Number item
    • with Series.str.replace(), filter out the number matches using the previous pattern, plus r'\s*' to match trailing whitespace characters, to obtain the residual Remarks

    Notes:

    • all rows in the sample input contain one valid Number except that one row contains multiple Number values separated by / delimiters, and another row contains no valid Number (it has S followed by 8 digits, more than the 7 that make a valid Number)