I have a pandas DataFrame with over a million rows and I need to find all unique values (for a given column) in the process of trying to normalise the data. The data in the given column is of type string - representing city names - and I have come a long way already in normalising the data, by taking certain steps as lower-casing all values in the column, striping and used pandas.core.frame.DataFrame.replace()
using regular expressions that seemed obvious for the data.
A sample of what I have ( and I want to achieve) found below:
In [1018]: sample
Out[1018]:
0
0 warsaw ## -> warsaw
1 krakow ## -> krakow
2 warszawa ## -> warsaw
3 cracovie ## -> krakow
4 warsawa ## -> warsaw
5 krkow ## -> krakow
6 krąków ## -> krakow
7 krakowie ## -> krakow
8 kraków ## -> krakow
9 varşovia ## -> warsaw
10 warschau ## -> warsaw
only much bigger set of data than this, such that I need to search for variations of different city names using reg expressions in order to find all existing in the dataset versions and proceed with the normalisation.
In [1023]: df.column_a.unique()
Out[1023]:
array(['warsaw', 'bydgoszcz', 'null', ..., 'kłodawa', 'kościelna wieś',
'poznań-jeżyce'], dtype=object)
In [1024]: len(df.column_a.unique())
Out[1024]: 3798
I have tried with .str.contains()
but I am only getting a boolean for those indices that have - under the defined column - values that match the given regex:
In [1029]: df.column_a.str.contains(r"\bwar.*")
Out[1029]:
0 True
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 False
9 False
10 False
...
However, what I'm looking for is to get the actual values that match the given regex. For the above example I would like to be able to get something like:
['warsaw','warszawa','warsawa','warschau']
Use boolean indexing -- refer to document
In [143] df[df.column_a.str.contains(r'\bwar.*')]
Out [143]
0 warsaw
2 warszawa
4 warsawa
10 warschau
If there are null values then do this:
df[pd.notnull(df.column_a) & df.column_a.str.contains(r'\bwar.*')]