python pandas scikit-learn logistic-regression crosstab

Logistic Regression Model (binary) crosstab error = shape of passed values issue

I am currently trying to run logistic regression for a data set. I dummy encoded my cat variables and normalized my continuous variables, and I fill null values with -1 (which works for my dataset). I am going through the steps and I am not getting any errors until I try to run my crosstab where its complaining about the shape of my the values passed. I'm getting the same error for both LogR w/ and w/out CV. I have included my code below, I did not include the encoding because that does not seem to be the issue or the code LogR w/out CV because it is basically identical except it excluding the CV.

# read in the df w/ encoded variables
allyrs=pd.read_csv("C:/Users/cyrra/OneDrive/Documents/Pythonread/HDS805/CS1W1/modelready_working.csv")

# Find locations of where I need to trim the data down selecting only the encoded variables
allyrs.columns.get_loc("BMI_C__-1.0")
23
allyrs.columns.get_loc("N_BMIR")
152

# Finding the location of the Y col
allyrs.columns.get_loc("CM")
23

#create new X and y for binary LR
y_bi = allyrs[["CM"]]
X_bi = allyrs.iloc[0:1305720, 23:152]

I then went ahead and checked the lengths of both variables and checked for all the columns in the X set, everything was there. The values are as followed: y_bi = 1305720 rows x 1 col , X_bi = 1305720 rows × 129 columns

# Create test/train
# Create test/train for bi column
from sklearn.model_selection import train_test_split
Xbi_train, Xbi_test, ybi_train, ybi_test = train_test_split(X_bi, y_bi,
                                                    train_size=0.8,test_size = 0.2)

again I check the size of Xbi_train and & Ybi_train: Xbi_train=1044576 rows × 129 columns, ybi_train= 1044576 rows × 1 columns

# LRw/CV for the binary col
from sklearn.linear_model import LogisticRegressionCV
logitbi_cv = LogisticRegressionCV(cv=2, random_state=0).fit(Xbi_train, ybi_train)

# Set predicted (checking to see if its an array)
logitbi_cv.predict(Xbi_train)
array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

# Set predicted to its own variable 
[IN]:pred_logitbi_cv =logitbi_cv.predict(Xbi_train)

# Cross tab LR w/0ut
from sklearn.metrics import confusion_matrix
ct_bi_cv=pd.crosstab(ybi_train, pred_logitbi_cv)

The error:

[OUT]:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_arrays(arrays, names, axes)
   1701         blocks = _form_blocks(arrays, names, axes)
-> 1702         mgr = BlockManager(blocks, axes)
   1703         mgr._consolidate_inplace()

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in __init__(self, blocks, axes, do_integrity_check)
    142         if do_integrity_check:
--> 143             self._verify_integrity()
    144 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in _verify_integrity(self)
    322             if block.shape[1:] != mgr_shape[1:]:
--> 323                 raise construction_error(tot_items, block.shape[1:], self.axes)
    324         if len(self.items) != tot_items:

ValueError: Shape of passed values is (1, 2), indices imply (1044576, 2)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-121-c669b17c171f> in <module>
      1 # LR W/ CV
      2 # Cross tab LR w/0ut
----> 3 ct_bi_cv=pd.crosstab(ybi_train, pred_logitbi_cv)

~\anaconda3\lib\site-packages\pandas\core\reshape\pivot.py in crosstab(index, columns, values, rownames, colnames, aggfunc, margins, margins_name, dropna, normalize)
    596         **dict(zip(unique_colnames, columns)),
    597     }
--> 598     df = DataFrame(data, index=common_idx)
    599     original_df_cols = df.columns
    600 

~\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
    527 
    528         elif isinstance(data, dict):
--> 529             mgr = init_dict(data, index, columns, dtype=dtype)
    530         elif isinstance(data, ma.MaskedArray):
    531             import numpy.ma.mrecords as mrecords

~\anaconda3\lib\site-packages\pandas\core\internals\construction.py in init_dict(data, index, columns, dtype)
    285             arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
    286         ]
--> 287     return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    288 
    289 

~\anaconda3\lib\site-packages\pandas\core\internals\construction.py in arrays_to_mgr(arrays, arr_names, index, columns, dtype, verify_integrity)
     93     axes = [columns, index]
     94 
---> 95     return create_block_manager_from_arrays(arrays, arr_names, axes)
     96 
     97 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_arrays(arrays, names, axes)
   1704         return mgr
   1705     except ValueError as e:
-> 1706         raise construction_error(len(arrays), arrays[0].shape, axes, e)
   1707 
   1708 

ValueError: Shape of passed values is (1, 2), indices imply (1044576, 2)

I realize this is saying that the number of rows being passed in to the cross tab doesn't match but can someone tell me why this is happening or where I am going wrong? I am copying the example code with my own data exactly as it was provided in the book I am working from .

Thank you so much!

Solution

Your target variable should be of shape (n,) not (n,1) as is your case when you call y_bi = allyrs[["CM"]] . See the relevant help page. There should be a warning about this because the fit will not work but I guess this was missed somehow.

If you call y_bi = allyrs["CM"], for example, if I set up some dummy data:

import numpy as np
import pandas as pd

np.random.seed(111)
allyrs = pd.DataFrame(np.random.binomial(1,0.5,(100,4)),columns=['x1','x2','x3','CM'])
X_bi = allyrs.iloc[:,:4]
y_bi = allyrs["CM"]

Then run the train test split followed by the fit:

from sklearn.model_selection import train_test_split
Xbi_train, Xbi_test, ybi_train, ybi_test = train_test_split(X_bi, y_bi,
                                                    train_size=0.8,test_size = 0.2)

from sklearn.linear_model import LogisticRegressionCV
logitbi_cv = LogisticRegressionCV(cv=2, random_state=0).fit(Xbi_train, ybi_train)

pred_logitbi_cv =logitbi_cv.predict(Xbi_train)
pd.crosstab(ybi_train, pred_logitbi_cv)

col_0   0   1
CM           
0      39   0
1       0  41