Search code examples
python-3.xpandaspandasql

How to compare one value against multiple values in Python with Pandas library?


Currently, I have the following scenario.

Excel Data Frame =            SQL Data Frame =
________                          ________ _______ ___________ _________
|sector|                          |sector| | hour| | value_cs| value_ps|
--------                          -------- ------- ----------- ---------
 AXYZ                              AXYZ      0        78.90      87.10
 BYYT                              RACH      0        87.12      13.90
 IOPL                              IOPL      0        93.10      13.87
 XFTR                              AXYZ      1        27.90      12.87
 MANU                              IOPL      1        23.09      90.09
                                   FRES      2        34.09      12.34
                                   YYYT      2        12.43      32.98
                                   REWT      3        98.09      99.99

I have one Excel file and a set of SQL results and I want to compare each value of the Sector column from the Excel file against all the values from the Sector column in the SQL results, in consequence, if there is a match between the values of these two columns then add the columns hour, value_cs and value_ps from the SQL results into new data frames. Notice: The data of the SQL results are not the same size as the data of the Excel file.

Desired results

 New data frame 1 for value cs
  ________ ____    ___    ___    ___    ___    ___    ___        ____                     
  |sector|  |0|    |1|    |2|    |3|    |4|    |5|    |6|   .... |23|
  -------- ----    ---    ----   ---    ---    ---    ----       ----                            
   AXYZ    78.90   27.90  78.89  54.90  98.23  85.0   45.90      68.23
   BYYT    18.94   67.10  65.69  76.32  76.56  56.03  56.23      87.65
   IOPL    93.10   23.09  34.29  97.34  34.34  14.54  34.91      23.21
   ...      ...

 New data frame 2 for value ps
  ________ ____    ___    ___    ___    ___    ___    ___        ____                     
  |sector|  |0|    |1|    |2|    |3|    |4|    |5|    |6|   .... |23|
  -------- ----    ---    ----   ---    ---    ---    ----       ----                            
   AXYZ    87.10   12.87  49.89  84.90  76.23  15.01  12.90      68.23
   BYYT    28.43   27.11  54.69  57.12  19.56  45.12  45.23      47.15
   IOPL    13.87   90.09  24.19  47.34  18.34  21.54  67.11      13.61
   ...      ...

The approach I followed was to convert the SQL results into a data frame as well as the data from the Excel file, but I do not know how to perform the comparison without a for loop, but only using Pandas (the for loop would take too much time to perform the calculations).

import pandas as pd
import pypyodbc
from datetime import date

def get_and_compare():

    start_date = date.today()

    retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date)

    # Connect to the database
    db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx",
                                         pwd="xxx", Trusted_Connection="No")

    # Get the sql result into dataframe
    data_frame_sql = pd.read_sql(retrieve_values,db_connection)


    #declare new data frames
    new_df_one = pd.DataFrame(columns=['sector', 'value cs', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4',
                                   'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11',
                                   'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18',
                                   'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23'])

    new_df_two = pd.DataFrame(columns=['sector', 'value ps', 'hour 0', 'hour 1', 'hour 2', 'hour 3', 'hour 4',
                                   'hour 5', 'hour 6', 'hour 7', 'hour 8', 'hour 9', 'hour 10', 'hour 11',
                                   'hour 12', 'hour 13', 'hour 14', 'hour 15', 'hour 16', 'hour 17', 'hour 18',
                                   'hour 19', 'hour 20', 'hour 21', 'hour 22', 'hour 23'])


    # Read the Excel file
    current_wb = pd.ExcelFile \
    ("C:\\U\\dev\\testing\\Main values to compare.xlsx")

    # Get the specific sheet to compare
    working_values = current_wb.parse("Main values")

    #Get the column from Excel
    sector_from_excel  = working_values['sector']

    #Comparison to perform
    #.... unknown part

All suggestions, comments will be appreciated to help me to complete this part of the code.


Solution

  • Try this:

    def get_and_compare():
    
        start_date = date.today()
    
        retrieve_values = "[DEV].[CS].[QA_Export] @start_date='{start_date:%Y-%m-%d}'".format(start_date=start_date)
    
        # Connect to the database
        db_connection = pypyodbc.connect(driver="{SQL Server}", server="xxx.xxx.xxx.xxx", uid="xxx",
                                             pwd="xxx", Trusted_Connection="No")
    
        # Get the sql result into dataframe
        data_frame_sql = pd.read_sql(retrieve_values,db_connection)
    
    
        # Read the Excel file
        current_wb = pd.ExcelFile \
        ("C:\\U\\dev\\testing\\Main values to compare.xlsx")
    
        # Get the specific sheet to compare
        working_values = current_wb.parse("Main values")
    
        #Get the column from Excel
        sector_from_excel  = working_values['sector']
    
        # perform inner join between DataFrames
        # note: this requires that "sector" is a column (and not an index)
        # in both DataFrames, and that it is also named as "sector" in each
        merged_df = data_frame_sql.merge(sector_from_excel, how="inner", on="sector")
    
        # use "pivot" to reshape data from wide to long
        # first with value_cs
        cs_value_df = merged_df.pivot(index="sector", columns="hour", values="value_cs")
    
        # and then with value_ps
        ps_value_df = merged_df.pivot(index="sector", columns="hour", values="value_ps")
    
        # I'd suggest returning both DataFrames in a single object;
        # in this case I'm using a dict
        return {"value cs": cs_value_df, "value ps": ps_value_df}
    

    For what it's worth, I'd recommend splitting this function into multiple functions, perhaps one to generate your SQL query, one to read your Excel file, and one to perform the Pandas operations. It's not a good practice to write so many actions into one function -- it will be tedious to debug, if it's ever necessary to do so.