I'm fairly new to python and am a bit stuck on a problem I'm having extracting data that is quite messy, apologies if it's unclear.
I am trying to extract information from a large dataset (>200million rows) and need to base them on a few things.
Firstly, each row has an ID, all in the same format, 3 letters then a range of numbers afterwards, with no extra information in the cell.
Secondly, the CODE cells contain either a 4 letter code (CODE, DIFF, RAND etc) with some other information in the cell. A problem however, is the 4 letter code can sometimes be attached to 'SS', making them look like (CODESS, SSDIFF, RANDSS).
Finally, the DESCRIPTION column contains the data that I want to extract, where it has been entered as a free text description. The data ranges between XX12345 and XX12345678, where XX is a set of 2 letters, in the combinations of (XX | XY | XZ | YZ).
For example, my columns look like this:
ID | CODE | DESCRIPTION |
---|---|---|
ID123 | rdm CODE | Some free text and the data required XX123 |
ID124 | RAND Rdm | Some text, no required data here and more text |
ID125 | DIFFSS... | XY12345 and some text, plus more text within this box, and more desired data XY123456 and also YZ1234567 |
What I'm looking to do is extract this information to a new data frame in this format:
ID | CODE | EXTRACT |
---|---|---|
ID123 | CODE | XX123 |
ID125 | DIFF | XY12345 |
ID125 | DIFF | XY123456 |
ID125 | DIFF | YZ1234567 |
One of the big problems that I'm having is that most of the commands I've looked for and some similar problems, is using str.match or str.extract only seems to work if every row has one of these values, and that isn't the case for this dataset.
I've tried quite a few different ways but keep getting stuck on all the different parts of this and was hoping someone would be able to set me in the right direction with this?
Thank you!
Based on your description, you should be able to use:
df['CODE'] = df['CODE'].str.extract(r'\b([A-Z]{4})(?:SS)?\b')
df['EXTRACT'] = df.pop('DESCRIPTION').str.findall(r'\b((?:X[XYZ]|YZ)\d{5,8})\b')
df = df.explode('EXTRACT').dropna().reset_index(drop=True)
Output for your sample data (note there is no output for ID123
as the XX
value has only 3 digits, and you say in the question there should be at least 5)
ID CODE EXTRACT
0 ID125 DIFF XY12345
1 ID125 DIFF XY123456
2 ID125 DIFF YZ1234567