Search code examples
pythonstringlistsparse-matrixrepeat

Using the number in front of a string to repeat that string as many times as the number says


I am trying to create a spare matrix assigning 0 and 1 to different roles in order to match two different dataframes. However, I am struggling with the format of the data:

    data = {'Location': {0: 'Madrid',
  1: 'Barcelona',
  2: 'Paris',
  3: 'London ',
  4: 'New York',
  5: 'Berlin',
  6: 'Birminham',
  7: 'Tanzania'},
 'Description': {0: 'M3',
  1: 'P5',
  2: 'M3P5',
  3: 'M3',
  4: 'M3P5T8',
  5: 'P5T8',
  6: '',
  7: 'FT7 M3'},
 'Branch_A': {0: 'Auditor or Auditor(S), Accountant or Accountant(S), PayRoll_Manager, 2 Brand_Manager, 3 IT_Support, Business_analyst, Developer, Cyber_security',
  1: 'Accountant or Accountant(S), PayRoll_Manager, Brand_Manager, 2 Developer, Cyber_security',
  2: 'Auditor or Auditor(S), 2 Accountant, Business_analyst, Developer, Cyber_security',
  3: "Auditor or Auditor(S), Accountant, PayRoll_Manager, 3 Brand_Manager, 2 IT_Support, Business_analyst, Developer, Cyber_security'",
  4: 'Auditor or Auditor(S), Accountant or Accountant(S), PayRoll_Manager, Brand_Manager, IT_Support, Business_analyst, Developer, Cyber_security',
  5: 'Auditor or Auditor(S), 2 PayRoll_Manager, Brand_Manager, 2 Business_analyst, Developer, Cyber_security',
  6: '----',
  7: 'Auditor or Auditor(S), IT_Support, Business_analyst, Developer, Cyber_security'},
 'Branch_B': {0: 'Accountant or Accountant(S), PayRoll_Manager, Brand_Manager',
  1: 'Accountant or Accountant(S), PayRoll_Manager, Brand_Manager',
  2: 'Accountant or Accountant(S), PayRoll_Manager, Brand_Manager',
  3: 'Accountant or Accountant(S), PayRoll_Manager, Brand_Manager',
  4: '',
  5: 'Accountant or Accountant(S), PayRoll_Manager, Brand_Manager, Developer',
  6: '',
  7: ''},
 'Branch_C': {0: 'IT_Support, Business_analyst, Developer, Cyber_security',
  1: 'IT_Support, Business_analyst, Developer, Cyber_security',
  2: 'IT_Support, Business_analyst, Developer, Cyber_security',
  3: '',
  4: 'IT_Support, Business_analyst, Developer, Cyber_security',
  5: 'IT_Support, Business_analyst, Developer, Cyber_security',
  6: '----',
  7: ''}}

I managed to create a list of dictionaries so I can split every single role:

def extract_data(row):
    positions = row['Branch_A'].split(',')
    result = []
    for pos in positions:
        result.append({pos})
    return result

df['Branch_A'] = df.apply(extract_data, axis=1)

Obtaining:

df['Branch_A']
0    [{Auditor or Auditor(S)}, { Accountant or Acco...
1    [{Accountant or Accountant(S)}, { PayRoll_Mana...
2    [{Auditor or Auditor(S)}, { 2 Accountant}, { B...
3    [{Auditor or Auditor(S)}, { Accountant}, { Pay...
4    [{Auditor or Auditor(S)}, { Accountant or Acco...
5    [{Auditor or Auditor(S)}, { 2 PayRoll_Manager}...
6                                             [{----}]
7    [{Auditor or Auditor(S)}, { IT_Support}, { Bus...

What I am trying to do is, where I have a number in front of the role, repeat that role as many time as the number says. The idea is to fill those roles with workers that I have in another dataframe, however I do not think that I can find a code that would understand that I want 2 PayRoll Managers for example. Also, in the workers dataframe I have some workers that are Auditor and others who are Auditor(S). Is there any way where I can write a code that will use an "or" operator?

Thank youuu


Solution

  • What you expect it's not really clear but you can use the following code as starting point to extract number of roles per location

    pat = r'(?P<Number>\d*)?\s*(?P<Role>[^,]+),?\s*'
    out = df2.melt(['Location', 'Description'], var_name='Branch', value_name='Positions')
    out = out.join(out.pop('Positions').str.extractall(pat).fillna(1).droplevel(1))
    

    Output:

    >>> out
         Location Description    Branch Number                         Role
    0      Madrid          M3  Branch_A      1        Auditor or Auditor(S)
    0      Madrid          M3  Branch_A      1  Accountant or Accountant(S)
    0      Madrid          M3  Branch_A      1              PayRoll_Manager
    0      Madrid          M3  Branch_A      2                Brand_Manager
    0      Madrid          M3  Branch_A      3                   IT_Support
    ..        ...         ...       ...    ...                          ...
    21     Berlin        P5T8  Branch_C      1             Business_analyst
    21     Berlin        P5T8  Branch_C      1                    Developer
    21     Berlin        P5T8  Branch_C      1               Cyber_security
    22  Birminham              Branch_C      1                         ----
    23   Tanzania      FT7 M3  Branch_C    NaN                          NaN
    
    [88 rows x 5 columns]