Search code examples
pythonpandasdataframestock

Extract column names from unstacked correlation matrix


I have a dataset with stock price history that I have run a correlation matrix on in pandas. I have unstacked the correlation matrix and filtered it to remove duplicates/redundant values and sorted it like so.

corr_pairs = corr_matrix.unstack().drop_duplicates().sort_values(kind='quicksort')
corr_pairs = corr_pairs[corr_pairs < 1]
print('Sorted Pairs: ')
print(corr_pairs)


Sorted Pairs: 
tsla   meta     0.709581
googl  meta     0.816701
tsla   googl    0.916947

Now I wish to extract the names of the stocks that had the highest correlation in the set and assign them to variables such as stock1 and stock2. If I do corr_pairs[-1] this will produce the value for the highest correlation, I need the names of the stocks that have the highest correlation, in this case, tsla and googl.

Thank you to anyone who knows what to do! Appreciate the help!


Solution

  • Your "corr_pairs" will in fact be a pd.Series with a pd.MultiIndex:

    import pandas as pd
    
    d = {('tsla', 'meta'): 0.709581,
     ('googl', 'meta'): 0.816701,
     ('tsla', 'googl'): 0.916947}
    
    corr_pairs = pd.Series(d)
    
    print(corr_pairs.index)
    
    MultiIndex([( 'tsla',  'meta'),
                ('googl',  'meta'),
                ( 'tsla', 'googl')],
               )
    

    So, the MultiIndex is basically an array with tuples. We can access the tuple for the max value by using pd.Series.idxmax, and then simply assign the result to two variables:

    stock1, stock2 = corr_pairs.idxmax()
    
    print(f'stock1 = {stock1}, stock2 = {stock2}')
    # stock1 = tsla, stock2 = googl