Search code examples
pythonregexpandasregex-group

How to select only numbers/digits from a given string and skip text using python regex?


Given Strings:

57 years, 67 daysApr 30, 1789

61 years, 125 daysMar 4, 1797

57 years, 325 daysMar 4, 1801

57 years, 353 daysMar 4, 1809

58 years, 310 daysMar 4, 1817

In regex101:

Pattern = (?P<Years>[\d]{1,2}) years, (?P<Days>[\d]{1,3}) days(?P<Month>[\w]{3} [\d]{1,2}), (?P<Year>[\d]{4})

Output: Output of Regex Pattern

In Python(IDE : Jupyter Notebook) : Python Output Here it is showing only nan values in dataframe, how to solve this ?


Solution

  • FYI, your code ran perfectly for me, maybe you have some whitespace issues in your dataframe:

    import pandas as pd
    import numpy as np
    
    from io import StringIO
    
    st = StringIO("""57 years, 67 daysApr 30, 1789
    
    61 years, 125 daysMar 4, 1797
    
    57 years, 325 daysMar 4, 1801
    
    57 years, 353 daysMar 4, 1809
    
    58 years, 310 daysMar 4, 1817""")
    
    df = pd.read_csv(st, sep='\s\s\s+', header=None, engine='python')
    
    Pattern = '(?P<Years>[\d]{1,2}) years, (?P<Days>[\d]{1,3}) days(?P<Month>[\w]{3} [\d]{1,2}), (?P<Year>[\d]{4})'
    
    df[0].str.extract(Pattern)
    

    Output:

      Years Days   Month  Year
    0    57   67  Apr 30  1789
    1    61  125   Mar 4  1797
    2    57  325   Mar 4  1801
    3    57  353   Mar 4  1809
    4    58  310   Mar 4  1817