Search code examples
pythonpandasfilteringpython-reisin

How to modify multiple values in one column, but skip others in pandas python


Going on two months in python and I am focusing hard on Pandas right now. In my current position I use VBA on data frames, so learning this to slowly replace it and further my career. As of now I believe my true problem is the lack of understanding a key concept(s). Any help would be greatly appreciated.

That said here is my problem:

Where could I go to learn more on how to do stuff like this for more precise filtering. I'm very close but there is one key aspect I need.

Goal(s)

Main goal I need to skip certain values in my ID column. The below code takes out the Dashes "-" and only reads up to 9 digits. Yet, I need to skip certain IDs because they are unique.

After that I'll start to work on comparing multiple sheets.

  • Main data frame IDs is formatted as 000-000-000-000
  • The other data frames that I will compare it to have it with no dashes "-" as 000000000 and three less 000's totaling nine digits.

The unique IDs that I need skipped are the same in both data frames, but are formatted completely different ranging from 000-000-000_#12, 000-000-000_35, or 000-000-000_z.

My code that I will use on each ID except the unique ones:

 dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]

but I want to use an if statement like (This does not work)

lst = ["000-000-000_#69B", "000-000-000_a", "etc.. random IDs", ]

if ~dfSS["ID"].isin(lst ).any()
    dfSS["ID"] = dfSS["ID"].str.replace("-", "").str[:9]
else:
    pass

For more clarification my input DataFrame is this:

            ID               Street #   Street Name 
0   004-330-002-000         2272        Narnia  
1   021-521-410-000_128     2311        Narnia  
2   001-243-313-000         2235        Narnia  
3   002-730-032-000         2149        Narnia
4   000-000-000_a           1234        Narnia

And I am looking to do this as the output:

            ID               Street #   Street Name 
0   004330002               2272        Narnia  
1   021-521-410-000_128     2311        Narnia  
2   001243313000            2235        Narnia  
3   002730032000            2149        Narnia
4   000-000-000_a           1234        Narnia

Notes:

  • dfSS is my Dataframe variable name aka the excel I am using. "ID" is my column heading. Will make this an index after the fact
  • My Data frame on this job is small with # of (rows, columns) as (2500, 125)
  • I do not get an error message so I am guessing maybe I need a loop of some kind. Starting to test for loops with this as well. no luck there... yet.

Here is where I have been to research this:


Solution

  • There are a number of ways to do this. The first way here doesn't involve writing a function.

    # Create a placeholder column with all transformed IDs
    dfSS["ID_trans"] = dfSS["ID"].str.replace("-", "").str[:9]
    dfSS.loc[~dfSS["ID"].isin(lst), "ID"] = dfSS.loc[~dfSS["ID"].isin(lst), "ID_trans"] # conditional indexing
    

    The second way is to write a function that conditionally converts the IDs, and it's not as fast as the first method.

    def transform_ID(ID_val):
        if ID_val not in lst:
            return ID_val.replace("-", "")[:9]
    
    dfSS['ID_trans'] = dfSS['ID'].apply(transform_ID)