Search code examples
pythonpandasdataframeextractdata-cleaning

Extracting data based on multiple conditions and unknowns for a data cleaning problem


I'm fairly new to python and am a bit stuck on a problem I'm having extracting data that is quite messy, apologies if it's unclear.

I am trying to extract information from a large dataset (>200million rows) and need to base them on a few things.

Firstly, each row has an ID, all in the same format, 3 letters then a range of numbers afterwards, with no extra information in the cell.

Secondly, the CODE cells contain either a 4 letter code (CODE, DIFF, RAND etc) with some other information in the cell. A problem however, is the 4 letter code can sometimes be attached to 'SS', making them look like (CODESS, SSDIFF, RANDSS).

Finally, the DESCRIPTION column contains the data that I want to extract, where it has been entered as a free text description. The data ranges between XX12345 and XX12345678, where XX is a set of 2 letters, in the combinations of (XX | XY | XZ | YZ).

For example, my columns look like this:

ID CODE DESCRIPTION
ID123 rdm CODE Some free text and the data required XX123
ID124 RAND Rdm Some text, no required data here and more text
ID125 DIFFSS... XY12345 and some text, plus more text within this box, and more desired data XY123456 and also YZ1234567

What I'm looking to do is extract this information to a new data frame in this format:

ID CODE EXTRACT
ID123 CODE XX123
ID125 DIFF XY12345
ID125 DIFF XY123456
ID125 DIFF YZ1234567

One of the big problems that I'm having is that most of the commands I've looked for and some similar problems, is using str.match or str.extract only seems to work if every row has one of these values, and that isn't the case for this dataset.

I've tried quite a few different ways but keep getting stuck on all the different parts of this and was hoping someone would be able to set me in the right direction with this?

Thank you!


Solution

  • Based on your description, you should be able to use:

    df['CODE'] = df['CODE'].str.extract(r'\b([A-Z]{4})(?:SS)?\b')
    df['EXTRACT'] = df.pop('DESCRIPTION').str.findall(r'\b((?:X[XYZ]|YZ)\d{5,8})\b')
    df = df.explode('EXTRACT').dropna().reset_index(drop=True)
    

    Output for your sample data (note there is no output for ID123 as the XX value has only 3 digits, and you say in the question there should be at least 5)

          ID  CODE    EXTRACT
    0  ID125  DIFF    XY12345
    1  ID125  DIFF   XY123456
    2  ID125  DIFF  YZ1234567