Search code examples
pythonpandaslambdapivot-table

Create multiple date columns from a date column, where row is not empty or null


Im trying to use the method pivot_table() to convert test analysis from rows to columns, I got a test analysis, a test result and a test date, I need to convert the test date to column and drop this column used in the conversion. This is the actual table result from a sql query in long format

lot_nbr batch_nbr sku variety factory_nbr code_nbr date_test test result inspector
01983244324 45A6545 764H max 3456 45099 10/02/22 physical_purity 0.87 jhon_martens
01983244324 45A6545 764H max 3456 45099 11/02/22 genetic_contamination 0.98 julian_doe
01983244324 45A6545 764H max 3456 45099 10/02/22 chemical_treatment_1 0.88 martha_doe
01983244324 45A6545 764H max 3456 45099 09/02/22 chemical_treatment_2 0.91 jhon_martens
01983244324 45A6545 764H max 3456 45099 10/02/22 chemical_treatment_3 0.93 julian_doe
01983244324 45A6545 764H max 3456 45099 11/02/22 total_x_content 0.77 martha_doe
01983244324 45A6545 764H max 3456 45099 10/02/22 total_y_content 0.75 jhon_martens
01983244324 45A6545 764H max 3456 45099 09/02/22 total_z_content 0.79 julian_doe
01983244324 45A6545 764H max 3456 45099 10/02/22 oc_count 1 martha_doe
01983244324 45A6545 764H max 3456 45099 11/02/22 tx_count 0.1 jhon_martens
01983244324 45A6545 764H max 3456 45099 10/02/22 optic_total 0.57 julian_doe
01983244324 45A6545 764H max 3456 45099 09/02/22 optic_total_1 0.68 martha_doe
01983244325 45A6546 56H optimum 3456 45099 10/02/22 physical_purity 0.87 jhon_martens
01983244325 45A6546 56H optimum 3456 45099 11/02/22 genetic_contamination 0.98 julian_doe
01983244325 45A6546 56H optimum 3456 45099 10/02/22 oc_count 1 martha_doe
01983244325 45A6546 56H optimum 3456 45099 11/02/22 tx_count 0.1 jhon_martens
01983244325 45A6546 56H optimum 3456 45099 10/02/22 optic_total 0.57 julian_doe
01983244325 45A6546 56H optimum 3456 45099 09/02/22 optic_total_1 0.68 martha_doe

Desired output:

lot_nbr batch_nbr sku variety factory_nbr code_nbr date_test physical_purity date_test genetic_contamination date_test chemical_treatment_1 date_test chemical_treatment_2 date_test chemical_treatment_3 date_test total_x_content date_test total_y_content date_test total_z_content date_test oc_count date_test tx_count date_test optic_total date_test optic_total_1
01983244324 45A6545 764H max 3456 45099 10/02/22 0.87 11/02/22 0.98 10/02/22 0.88 09/02/22 0.91 10/02/22 0.93 11/02/22 0.77 10/02/22 0.75 09/02/22 0.79 10/02/22 1 11/02/22 0.1 10/02/22 0.57 09/02/22 0.68
01983244325 45A6546 56H optimum 3456 45099 11/02/22 0.87 10/02/22 0.98 NA NA NA NA NA NA NA NA NA NA NA NA NA 1 11/02/22 0.1 10/02/22 0.57 09/02/22 0.68

what is the most efficient method to get this I was thinking on using a lambda after using pivot_table, this is the pivot_table code:

#convert date to datetime
df.date_test = pd.to_datetime(df['date_test'])
df1 = df.pivot_table(index =['lot_nbr', 'batch_nbr','sku', 'variety', 'factory_nbr',    'code_nbr','date_test'],columns=['test'], values=['result']).reset_index()
#flatten MultiIndex in Pandas
df1.columns = df1.columns.map('_'.join)
df1

I got this table:

lot_nbr_ batch_nbr_ sku_ variety_ factory_nbr_ code_nbr_ date_test_ result_chemical_treatment_1 result_chemical_treatment_2 result_chemical_treatment_3 result_genetic_contamination result_oc_count result_optic_total result_optic_total_1 result_physical_purity result_total_x_content result_total_y_content result_total_z_content result_tx_count
1983244324 45A6545 764H max 3456 45099 09/02/22 0.91 0.68 0.79
1983244324 45A6545 764H max 3456 45099 10/02/22 0.88 0.93 1 0.57 0.87 0.75
1983244324 45A6545 764H max 3456 45099 11/02/22 0.98 0.77 0.1
1983244325 45A6546 56H optimum 3456 45099 09/02/22 0.68
1983244325 45A6546 56H optimum 3456 45099 10/02/22 1 0.57 0.87
1983244325 45A6546 56H optimum 3456 45099 11/02/22 0.98 0.1

a lambda function will be the best option after this transformation to append the date column per each test? or pivot only test, test result and date, using lot_nbr as index and then create a date column containing this dates per each result. Then group by the lot_nbr,batch_nbr,sku, variety,factory_nbr,code_nbr, and merge this tables using lot_nbr? I Know I can subset the columns like:

columns_dates= df1[['physical_purity','genetic_contamination','chemical_treatment_1',   'chemical_treatment_2','chemical_treatment_3','total_x_content','total_y_content',  'total_z_content','oc_count','tx_count','optic_total','optic_total_1']]

Then apply lambda or function to this columns.

I attach the initial Dataframe in dict format:

{'lot_nbr': {0: 1983244324,
  1: 1983244324,
  2: 1983244324,
  3: 1983244324,
  4: 1983244324,
  5: 1983244324,
  6: 1983244324,
  7: 1983244324,
  8: 1983244324,
  9: 1983244324,
  10: 1983244324,
  11: 1983244324,
  12: 1983244325,
  13: 1983244325,
  14: 1983244325,
  15: 1983244325,
  16: 1983244325,
  17: 1983244325},
 'batch_nbr': {0: '45A6545',
  1: '45A6545',
  2: '45A6545',
  3: '45A6545',
  4: '45A6545',
  5: '45A6545',
  6: '45A6545',
  7: '45A6545',
  8: '45A6545',
  9: '45A6545',
  10: '45A6545',
  11: '45A6545',
  12: '45A6546',
  13: '45A6546',
  14: '45A6546',
  15: '45A6546',
  16: '45A6546',
  17: '45A6546'},
 'sku': {0: '764H',
  1: '764H',
  2: '764H',
  3: '764H',
  4: '764H',
  5: '764H',
  6: '764H',
  7: '764H',
  8: '764H',
  9: '764H',
  10: '764H',
  11: '764H',
  12: '56H',
  13: '56H',
  14: '56H',
  15: '56H',
  16: '56H',
  17: '56H'},
 'variety': {0: 'max',
  1: 'max',
  2: 'max',
  3: 'max',
  4: 'max',
  5: 'max',
  6: 'max',
  7: 'max',
  8: 'max',
  9: 'max',
  10: 'max',
  11: 'max',
  12: 'optimum',
  13: 'optimum',
  14: 'optimum',
  15: 'optimum',
  16: 'optimum',
  17: 'optimum'},
 'factory_nbr': {0: 3456,
  1: 3456,
  2: 3456,
  3: 3456,
  4: 3456,
  5: 3456,
  6: 3456,
  7: 3456,
  8: 3456,
  9: 3456,
  10: 3456,
  11: 3456,
  12: 3456,
  13: 3456,
  14: 3456,
  15: 3456,
  16: 3456,
  17: 3456},
 'code_nbr': {0: 45099,
  1: 45099,
  2: 45099,
  3: 45099,
  4: 45099,
  5: 45099,
  6: 45099,
  7: 45099,
  8: 45099,
  9: 45099,
  10: 45099,
  11: 45099,
  12: 45099,
  13: 45099,
  14: 45099,
  15: 45099,
  16: 45099,
  17: 45099},
 'date_test': {0: Timestamp('2022-02-10 00:00:00'),
  1: Timestamp('2022-02-11 00:00:00'),
  2: Timestamp('2022-02-10 00:00:00'),
  3: Timestamp('2022-02-09 00:00:00'),
  4: Timestamp('2022-02-10 00:00:00'),
  5: Timestamp('2022-02-11 00:00:00'),
  6: Timestamp('2022-02-10 00:00:00'),
  7: Timestamp('2022-02-09 00:00:00'),
  8: Timestamp('2022-02-10 00:00:00'),
  9: Timestamp('2022-02-11 00:00:00'),
  10: Timestamp('2022-02-10 00:00:00'),
  11: Timestamp('2022-02-09 00:00:00'),
  12: Timestamp('2022-02-10 00:00:00'),
  13: Timestamp('2022-02-11 00:00:00'),
  14: Timestamp('2022-02-10 00:00:00'),
  15: Timestamp('2022-02-11 00:00:00'),
  16: Timestamp('2022-02-10 00:00:00'),
  17: Timestamp('2022-02-09 00:00:00')},
 'test': {0: 'physical_purity',
  1: 'genetic_contamination',
  2: 'chemical_treatment_1',
  3: 'chemical_treatment_2',
  4: 'chemical_treatment_3',
  5: 'total_x_content',
  6: 'total_y_content',
  7: 'total_z_content',
  8: 'oc_count',
  9: 'tx_count',
  10: 'optic_total',
  11: 'optic_total_1',
  12: 'physical_purity',
  13: 'genetic_contamination',
  14: 'oc_count',
  15: 'tx_count',
  16: 'optic_total',
  17: 'optic_total_1'},
 'result': {0: 0.87,
  1: 0.98,
  2: 0.88,
  3: 0.91,
  4: 0.93,
  5: 0.77,
  6: 0.75,
  7: 0.79,
  8: 1.0,
  9: 0.1,
  10: 0.57,
  11: 0.68,
  12: 0.87,
  13: 0.98,
  14: 1.0,
  15: 0.1,
  16: 0.57,
  17: 0.68},
 'inspector': {0: 'jhon_martens',
  1: 'julian_doe',
  2: 'martha_doe',
  3: 'jhon_martens',
  4: 'julian_doe',
  5: 'martha_doe',
  6: 'jhon_martens',
  7: 'julian_doe',
  8: 'martha_doe',
  9: 'jhon_martens',
  10: 'julian_doe',
  11: 'martha_doe',
  12: 'jhon_martens',
  13: 'julian_doe',
  14: 'martha_doe',
  15: 'jhon_martens',
  16: 'julian_doe',
  17: 'martha_doe'}} 

Solution

  • So, I think I've managed to achieve the desired result. You can use df.pivot. This is in fact the easy part:

    import pandas as pd
    from pandas import Timestamp
    import re
    
    # importing your dict
    df = pd.DataFrame(data)
    
    df_pivot = df.pivot(
        index=['lot_nbr','batch_nbr','sku','variety','factory_nbr','code_nbr'], 
        columns='test', 
        values=['date_test','result']
        )
    
    df_pivot.reset_index(drop=False, inplace=True)
    
    cols = [re.sub(r'_$','',x) for x in list(df_pivot.columns.map('_'.join))]
    df_pivot.columns = cols
    

    Unfortunately, pivot does not maintain order in the way you want it to be, so we need to do a little reordering:

    # remain the same
    first_cols = cols[:6]
    
    # reorder 'date_test_|result_'-cols based on order in df['test']
    date_cols = [f'date_test_{i}' for i in df.test.unique()]
    test_cols = [f'result_{i}' for i in df.test.unique()]
    
    # now group 'date_test_|result_'-pairs
    comb_cols = list()
    for i,j in zip(date_cols,test_cols):
        comb_cols.extend([i,j])
    
    # rebuild entire df_pivot.columns in correct order
    all_cols = list()
    all_cols.extend(first_cols+comb_cols)
    
    # assign df_pivot to a .loc-selection on those cols
    df_pivot = df_pivot.loc[:, all_cols]
    

    The col names are rather long now. You can of course make them a bit shorter, e.g. by replacing 'date_test_' with 'dt_' or something, but I trust you get the idea there. Result:

    with pd.option_context('display.max_rows', None, 'display.max_columns', None):
        print(df_pivot)
    
          lot_nbr batch_nbr   sku  variety  factory_nbr  code_nbr  \
    0  1983244324   45A6545  764H      max         3456     45099   
    1  1983244325   45A6546   56H  optimum         3456     45099   
    
      date_test_physical_purity result_physical_purity  \
    0                2022-02-10                   0.87   
    1                2022-02-10                   0.87   
    
      date_test_genetic_contamination result_genetic_contamination  \
    0                      2022-02-11                         0.98   
    1                      2022-02-11                         0.98   
    
      date_test_chemical_treatment_1 result_chemical_treatment_1  \
    0                     2022-02-10                        0.88   
    1                            NaT                         NaN   
    
      date_test_chemical_treatment_2 result_chemical_treatment_2  \
    0                     2022-02-09                        0.91   
    1                            NaT                         NaN   
    
      date_test_chemical_treatment_3 result_chemical_treatment_3  \
    0                     2022-02-10                        0.93   
    1                            NaT                         NaN   
    
      date_test_total_x_content result_total_x_content date_test_total_y_content  \
    0                2022-02-11                   0.77                2022-02-10   
    1                       NaT                    NaN                       NaT   
    
      result_total_y_content date_test_total_z_content result_total_z_content  \
    0                   0.75                2022-02-09                   0.79   
    1                    NaN                       NaT                    NaN   
    
      date_test_oc_count result_oc_count date_test_tx_count result_tx_count  \
    0         2022-02-10             1.0         2022-02-11             0.1   
    1         2022-02-10             1.0         2022-02-11             0.1   
    
      date_test_optic_total result_optic_total date_test_optic_total_1  \
    0            2022-02-10               0.57              2022-02-09   
    1            2022-02-10               0.57              2022-02-09   
    
      result_optic_total_1  
    0                 0.68  
    1                 0.68  
    

    N.B. While comparing with your 'desired output', I noted a couple of differences. These seem to be mistakes in your version. E.g. your 2nd 'date_test' column has 12/02/22 for 2nd row, a date that does not actually occur in the original df.