Search code examples
sqldataframecorrelation

How to calculate the correlation of same column value between different group based on another column


I have a dataframe containing historical stock data looks like this:

stock_name price time_stamp
stock_a 35.6 2024-01-01
stock_a 34.7 2024-01-02
stock_b 56.7 2024-01-01
stock_b 55 2024-01-02

I have multiple stocks in database up to 500, and I want to calculate the correlation between stock_a and all other stocks (ex. stock_b, stock c, stock d...) with their price value across the time interval.

I expect getting the result like:

Stock_A correlation

stock_name correlation
stock_b 0.87
stock_c 0.76

Now I can only calculate the correlation while specifying each stock other than stock a one by one (PostgreSQL as example):

select stock_b, corr(d1.close_price, d2.close_price)
from stock_dmt d1
inner join stock_dmt d2
on d1.time_stamp = d2.time_stamp
where d1.stock_code = 'stock_a' and d2.stock_code = 'stock_b';

However it's impossible to do that manually for over 500 stocks.

Is there any efficient way to complete this task? Either with SQL or Python script is welcome.

Thank you


Solution

  • I am trying to do with python. I hope this way will meets your question.

    import pandas as pd
    
    # Creating a sample data
    data = {
        "stock_name": ["stock_a", "stock_a", "stock_b", "stock_b"],
        "price": [35.6, 34.7, 56.7,55],
        "time_stamp": ["2024-01-01", "2024-01-02", "2024-01-01", "2024-01-02"]
    }
    
    # Create the DataFrame from the dictionary
    df = pd.DataFrame(data)
    
    # Convert to the pivot table 
    
    df_pivot = df.pivot(index = 'time_stamp',columns=["stock_name"], values="price").reset_index()
    
    # If you can drop the column timestamp 
    df_pivot.drop(['time_stamp'], axis = 1).corr()