python pandas dataframe multiple-columns

Clean column in data frame

I'm trying to clean one column which contains the ID number which is starting from S and 7 numbers, e.g.: 'S1234567' and save only this number into new column. I started with this column named Remarks, this is an example of the data inside:

         Remarks
0   S0252508 Shipment UK
1   S0255111 Shipment UK
2   S0256352 Shipment UK
3   S0259138 Shipment UK
4   S0260425 Shipment US

I've managed to separate those rows which has the format S1234567 + text using this code:

merged_out['Remarks'] = merged_out['Remarks'].replace("\t", "\r")
merged_out['Remarks'] = merged_out['Remarks'].replace("\n", "\r")
s = merged_out['Remarks'].str.split("\r").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Remarks'
del merged_out['Remarks']
merged_out = merged_out.join(s)
merged_out[['Number','Remarks']] = merged_out.Remarks.str.split(" ", 1, expand=True)

After creating a data frame I found that there are a lot of mistakes inside of that column because the data are written there manually, so there are some examples of those wrong records:

    Number
0. Pallets:
1. S0246734/S0246735/S0246736
3. delivery
4. S0258780 31  cok
5. S0246732-
6. 2
7. ok
8. nan

And this is only the wrong data which are in the Number column, I will need to clear this and save only those which has the correct number, if there is sth. like that: S0246732/S0246736/S0246738, then I need to have separated row for each number with the same data as it was for this record. For the other one I need to save those which contains the number, the other should have the null value.

Solution

Here is a regex approach that will do what I think your question asks:

import pandas as pd
merged_out = pd.DataFrame({
'Remarks':[
'S0252508 Shipment UK',
'S0255111 Shipment UK',
'S0256352 Shipment UK',
'S0259138/S0259139 Shipment UK',
'S12345678 Shipment UK',
'S0260425 Shipment US']
})

pat = r'(?:(\bS\d{7})/)*(\bS\d{7}\b)'
df = merged_out.Remarks.str.extractall(pat)
df = ( pd.concat([
    pd.DataFrame(df.unstack().apply(lambda row: row.dropna().tolist(), axis=1), columns=['Number']), 
    merged_out], 
    axis=1).explode('Number') )
df.Remarks = df.Remarks.str.replace(pat + r'\s*', '', regex=True)

Input:

                          Remarks
0           S0252508 Shipment UK
1           S0255111 Shipment UK
2           S0256352 Shipment UK
3  S0259138/S0259139 Shipment UK
4          S12345678 Shipment UK
5           S0260425 Shipment US

Output:

     Number                 Remarks
0  S0252508            Shipment UK
1  S0255111            Shipment UK
2  S0256352            Shipment UK
3  S0259138            Shipment UK
3  S0259139            Shipment UK
5  S0260425            Shipment US
4       NaN  S12345678 Shipment UK

Explanation:

with Series.str.extractall(), use a pattern to obtain 0 or more occurrences of word boundary \b followed by S followed by 7 digits and a 1 occurrence of S followed by 7 digits (flanked by word boundaries \b)
use unstack() to eliminate multiple index levels
use apply() with dropna() and tolist() to create a new dataframe with a Number column containing a list of numbers for each row
use explode() to add new rows for lists with more than one Number item
with Series.str.replace(), filter out the number matches using the previous pattern, plus r'\s*' to match trailing whitespace characters, to obtain the residual Remarks

Notes:

all rows in the sample input contain one valid Number except that one row contains multiple Number values separated by / delimiters, and another row contains no valid Number (it has S followed by 8 digits, more than the 7 that make a valid Number)