Search code examples
pythonregexpandasnormalization

How to find actual unique values in pandas DataFrame that match a regex


I have a pandas DataFrame with over a million rows and I need to find all unique values (for a given column) in the process of trying to normalise the data. The data in the given column is of type string - representing city names - and I have come a long way already in normalising the data, by taking certain steps as lower-casing all values in the column, striping and used pandas.core.frame.DataFrame.replace() using regular expressions that seemed obvious for the data.

A sample of what I have ( and I want to achieve) found below:

In [1018]: sample
Out[1018]: 
       0
0     warsaw   ## -> warsaw
1     krakow   ## -> krakow
2   warszawa   ## -> warsaw
3   cracovie   ## -> krakow
4    warsawa   ## -> warsaw
5      krkow   ## -> krakow
6     krąków   ## -> krakow
7   krakowie   ## -> krakow
8     kraków   ## -> krakow
9   varşovia   ## -> warsaw
10  warschau   ## -> warsaw

only much bigger set of data than this, such that I need to search for variations of different city names using reg expressions in order to find all existing in the dataset versions and proceed with the normalisation.

In [1023]: df.column_a.unique()
Out[1023]: 
array(['warsaw', 'bydgoszcz', 'null', ..., 'kłodawa', 'kościelna wieś',
   'poznań-jeżyce'], dtype=object)

In [1024]: len(df.column_a.unique())
Out[1024]: 3798

I have tried with .str.contains() but I am only getting a boolean for those indices that have - under the defined column - values that match the given regex:

In [1029]: df.column_a.str.contains(r"\bwar.*")
Out[1029]: 
0       True
1      False
2      False
3       True
4      False
5      False
6      False
7      False
8      False
9      False
10     False
   ...  

However, what I'm looking for is to get the actual values that match the given regex. For the above example I would like to be able to get something like:

['warsaw','warszawa','warsawa','warschau']

Solution

  • Use boolean indexing -- refer to document

       In [143] df[df.column_a.str.contains(r'\bwar.*')]
       Out [143]
       0     warsaw
       2   warszawa
       4    warsawa
       10  warschau
    

    If there are null values then do this:

    df[pd.notnull(df.column_a) & df.column_a.str.contains(r'\bwar.*')]