I'm trying to clean one column which contains the ID number which is starting from S and 7 numbers, e.g.: 'S1234567' and save only this number into new column. I started with this column named Remarks, this is an example of the data inside:
Remarks
0 S0252508 Shipment UK
1 S0255111 Shipment UK
2 S0256352 Shipment UK
3 S0259138 Shipment UK
4 S0260425 Shipment US
I've managed to separate those rows which has the format S1234567 + text using this code:
merged_out['Remarks'] = merged_out['Remarks'].replace("\t", "\r")
merged_out['Remarks'] = merged_out['Remarks'].replace("\n", "\r")
s = merged_out['Remarks'].str.split("\r").apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'Remarks'
del merged_out['Remarks']
merged_out = merged_out.join(s)
merged_out[['Number','Remarks']] = merged_out.Remarks.str.split(" ", 1, expand=True)
After creating a data frame I found that there are a lot of mistakes inside of that column because the data are written there manually, so there are some examples of those wrong records:
Number
0. Pallets:
1. S0246734/S0246735/S0246736
3. delivery
4. S0258780 31 cok
5. S0246732-
6. 2
7. ok
8. nan
And this is only the wrong data which are in the Number column, I will need to clear this and save only those which has the correct number, if there is sth. like that: S0246732/S0246736/S0246738, then I need to have separated row for each number with the same data as it was for this record. For the other one I need to save those which contains the number, the other should have the null value.
Here is a regex approach that will do what I think your question asks:
import pandas as pd
merged_out = pd.DataFrame({
'Remarks':[
'S0252508 Shipment UK',
'S0255111 Shipment UK',
'S0256352 Shipment UK',
'S0259138/S0259139 Shipment UK',
'S12345678 Shipment UK',
'S0260425 Shipment US']
})
pat = r'(?:(\bS\d{7})/)*(\bS\d{7}\b)'
df = merged_out.Remarks.str.extractall(pat)
df = ( pd.concat([
pd.DataFrame(df.unstack().apply(lambda row: row.dropna().tolist(), axis=1), columns=['Number']),
merged_out],
axis=1).explode('Number') )
df.Remarks = df.Remarks.str.replace(pat + r'\s*', '', regex=True)
Input:
Remarks
0 S0252508 Shipment UK
1 S0255111 Shipment UK
2 S0256352 Shipment UK
3 S0259138/S0259139 Shipment UK
4 S12345678 Shipment UK
5 S0260425 Shipment US
Output:
Number Remarks
0 S0252508 Shipment UK
1 S0255111 Shipment UK
2 S0256352 Shipment UK
3 S0259138 Shipment UK
3 S0259139 Shipment UK
5 S0260425 Shipment US
4 NaN S12345678 Shipment UK
Explanation:
Series.str.extractall()
, use a pattern to obtain 0 or more occurrences of word boundary \b
followed by S
followed by 7 digits and a 1 occurrence of S
followed by 7 digits (flanked by word boundaries \b
)unstack()
to eliminate multiple index levelsapply()
with dropna()
and tolist()
to create a new dataframe with a Number
column containing a list of numbers for each rowexplode()
to add new rows for lists with more than one Number
itemSeries.str.replace()
, filter out the number matches using the previous pattern, plus r'\s*'
to match trailing whitespace characters, to obtain the residual Remarks
Notes:
Number
except that one row contains multiple Number
values separated by /
delimiters, and another row contains no valid Number
(it has S
followed by 8 digits, more than the 7 that make a valid Number
)