Search code examples
pythonregexpandasregex-negation

Match everything except a complex regex pattern and replace it in Pandas


I have a complex regex pattern to match mixed dates for a csv column in pandas df. I would like to replace everything except the regex pattern match with "" . I have tried pretty much all the negation cases (^ ?! and others). But I keep replacing the regex match with "" (empty string). My Code:

import pandas as pd 
df.read_csv('path')
df=DataFrame(df)
df.columns=['Date'] 
Date=df.Date
df['Date']=df['Date'].str.replace(r'^((\b(0?[1-9]|[12]\d|30|31)[^\w\d\r\n:](0?[1-9]|1[0-2])[^\w\d\r\n:](\d{4}|\d{2})\b)|(\b(0?[1-9]|1[0-2])[^\w\d\r\n:](0?[1-9]|[12]\d|30|31)[^\w\d\r\n:](\d{4}|\d{2})\b))','')

Some examples of my data:

Date
21/04/2004
[N/F]
6/07/2004
{}
[N/F]
6/10/2004
16/06/2004
{}
21/06/2004
[N/F]
1/03/2018
23/03/17
{}
{}
4/04/2006
19/05/2006
"**3/04/2006/-2/06
2006**"

Expected Output

21/04/2004

6/07/2004


6/10/2004
16/06/2004

21/06/2004

1/03/2018
23/03/17


4/04/2006
19/05/2006
3/04/2006

I would appreciate your help. Many thanks.


Solution

  • I have simplified your regex a little and am extracting rather than replacing:

    Loading your data to a DataFrame:

    import pandas as pd
    import numpy as np
    
    df = pd.read_csv('data.csv')
    print(df)
    

    Gives:

                             Date
    0                  21/04/2004
    1                       [N/F]
    2                   6/07/2004
    3                          {}
    4                       [N/F]
    5                   6/10/2004
    6                  16/06/2004
    7                          {}
    8                  21/06/2004
    9                       [N/F]
    10                  1/03/2018
    11                   23/03/17
    12                         {}
    13                         {}
    14                  4/04/2006
    15                 19/05/2006
    16  **3/04/2006/-2/06\n2006**
    

    Now extract anything that can be parsed as a date:

    pattern = r'(([1-9]|[12][0-9]|3[01])\/(0[1-9]|1[012])\/(20[01][0-9]|[0-9]{2}))'
    df['extracted_date'] = df['Date'].astype(str).str.extract(pattern)[0]
    df = df.fillna('')
    print(df)
    

    Which returns:

                             Date extracted_date
    0                  21/04/2004     21/04/2004
    1                       [N/F]               
    2                   6/07/2004      6/07/2004
    3                          {}               
    4                       [N/F]               
    5                   6/10/2004      6/10/2004
    6                  16/06/2004     16/06/2004
    7                          {}               
    8                  21/06/2004     21/06/2004
    9                       [N/F]               
    10                  1/03/2018      1/03/2018
    11                   23/03/17       23/03/17  
    12                         {}               
    13                         {}               
    14                  4/04/2006      4/04/2006
    15                 19/05/2006     19/05/2006
    16  **3/04/2006/-2/06\n2006**      3/04/2006