python pandas database dataframe regex-group

Pandas Dataframe extraction from columns with mixed entries

at the moment i am using a excel sheet as an example for evaluation. Its an copy of an import where the Programm tends to fill column with mixed entries.

In Excel its look like this: Excel sheet

My Code looks like this in VS Code:


   # 03.01.2024
   # Regex - Reguläre Ausdrücke zum Filtern, hier mit dem Frame aus der Auswertung

   import pandas as pd  
   import numpy as np
   from datetime import datetime

   from pandas import * 

   # Einlesen der Excel
   excel_file = 'F:\\Google Drive\\Matthias\\Arbeit\\ISQM\\08 - Tool Menü - Skripte DD                                                  19.12.2023\\Testskripte\\value_cnts_2.xlsx'  

  xlsx = ExcelFile(excel_file)                
  excel_Tabelle = xlsx.sheet_names[0]         
  df  = xlsx.parse(excel_Tabelle)  

  # df['Vorgänger'] = df['Vorgänger'].fillna(0)
  #df = df['Nachfolger'].str.replace('----', '')

  df['Verknüpfung1'] = df['Nachfolger'].str.extract('(?P<EE>EE)')
  #df['Verknüpfung3'] = df['Nachfolger'].str.extract('[EA+,AA,EA-]')
  df['Verknüpfung2'] = df['Nachfolger'].str.extract('(?P<EE>AA)')

  df['Verknüpfung1'] = df['Verknüpfung1'].fillna(0)
  df['Verknüpfung2'] = df['Verknüpfung2'].fillna(0)

  df`
`

I write the excel sheet into a pandas dataframe. As you can see i am trying the regex expressions in the extract function.

With the regex101 editor i found some regex, the named capturing group, which give me this output:

Nachfolger  Verknüpfung1    Verknüpfung2
0   54;20   0   0
1   ----    0   0
2   ----    0   0
3   ----    0   0
4   ----    0   0
5   ----    0   0
6   ----    0   0
7   ----    0   0
8   ----    0   0
9   ----    0   0
10  ----    0   0
11  ----    0   0
12  ----    0   0
13  ----    0   0
14  ----    0   0
15  ----    0   0
16  21;4;49;50;51   0   0
17  52  0   0
18  ----    0   0
19  52;128AA;207;22;223 0   AA
20  ----    0   0
21  52;24   0   0
22  28  0   0
23  ----    0   0
24  52;227;27   0   0
25  30  0   0
26  227 0   0
27  ----    0   0
28  52;31   0   0
29  ----    0   0
30  138EE;34    EE  0
31  ----    0   0
32  139EE;36    EE  0
33  ----    0   0
34  140EE;38    EE  0
35  ----    0   0
36  141EE;40    EE  0
37  ----    0   0

I want to filter the columns which contains characters. I want to count them.

My Question now is there way where you can use regular expressions only in one column. I tried another expression ([EA+,AA,EA-]) where i just get an Error.

Thank you!

Solution

What you expect is not clear. However, you can try:

df['Verknüpfung3'] = df['Nachfolger'].str.extract(r'(EA[+-]|AA)')

# OR

df['Verknüpfung'] = df['Nachfolger'].str.extract(r'(EE|EA[+-]|AA)')

Regex101