Search code examples
python-3.xpandasdataframedata-sciencetext-extraction

Extract only numbers and only string from pandas dataframe


I am trying to extract only numbers and only strings in two different dataframes. I am using regular expression to extract numbers and string.

import pandas as pd

df_num = pd.DataFrame({
        'Colors': ['lila1.5', 'rosa2.5', 'gelb3.5', 'grün4', 'rot5', 'schwarz6', 'grau7', 'weiß8', 'braun9', 'hellblau10'],
        'Animals': ['hu11nd', '12welpe', '13katze', 's14chlange', 'vo15gel', '16papagei', 'ku17h', '18ziege', '19pferd',
                    'esel20']
    })

for column in df_num.columns:
    df_num[column] = df_num[column].str.extract('(\d+)').astype(float)

print(df_num)

I have also tried using '([\d+][\d+\.\d+])' and '([\d+\.\d+])'

Here I am getting output but not what I am expecting. Though I am expecting float numbers I am not getting 1.5 or 2.5.

I am getting something like below image:

enter image description here

df_str = pd.DataFrame({
        'Colors': ['lila1.5', 'rosa2.5', 'gelb3', 'grün4', 'rot5', 'schwarz6', 'grau7', 'weiß8', 'braun9', 'hellblau10'],
        'Animals': ['hu11nd', '12welpe', '13katze', 's14chlange', 'vo15gel', '16papagei', 'ku17h', '18ziege', '19pferd',
                    'esel20']
    })

for column in df_str.columns:
    df_str[column] = df_str[column].str.extract('([a-zA-Z]+)')

print(df_str)

In this case when the number is at the end or in the beginning then I am getting the string but if the number placed in the middle or any other place then the result which I expect I am not getting. Current output is like below image:

enter image description here

I think my regular expression is not correct. Which will be the right regular expression to solve these problems? Or is there any other way to extract only numbers and only strings in pandas dataframe?


Solution

  • Your code is on the right track, you just need to account for the decimals and the possibility of integers :

    df_num['colors_num'] = df_num.Colors.str.extract(r'(\d+[.\d]*)')
    df_num['animals_num'] = df_num.Animals.str.extract(r'(\d+[.\d]*)')
    df_num['colors_str'] = df_num.Colors.str.replace(r'(\d+[.\d]*)','')
    df_num['animals_text'] = df_num.Animals.str.replace(r'(\d+[.\d]*)','')
    
    
        Colors  Animals colors_num  animals_num colors_str  animals_text
    0   lila1.5 hu11nd  1.5 11  lila    hund
    1   rosa2.5 12welpe 2.5 12  rosa    welpe
    2   gelb3.5 13katze 3.5 13  gelb    katze
    3   grün4   s14chlange  4   14  grün    schlange
    4   rot5    vo15gel 5   15  rot vogel
    5   schwarz6    16papagei   6   16  schwarz papagei
    6   grau7   ku17h   7   17  grau    kuh
    7   weiß8   18ziege 8   18  weiß    ziege
    8   braun9  19pferd 9   19  braun   pferd
    9   hellblau10  esel20  10  20  hellblau    esel