Search code examples
pythonpandasscikit-learnone-hot-encodingredundancy

Fixed redundancy in one-hot encoding but still having error when trying to invert it


I'm doing one-hot encoding and using 𝜃̂ =((𝕏𝑇𝕏)^−1) * 𝕏𝑇𝕪 to estimate theta. I was getting an error because of redundancies so I decided to drop the columns that have redundancies.

This is prior to dropping columns:

enter image description here This is my code for it as I try to drop the columns that have redundancies:

 def one_hot_encode_revised(data):
        
        all_columns = data.columns
    
        records = data[all_columns].to_dict(orient='records')
        encoder = DictVectorizer(sparse=False)
        encoded_X = encoder.fit_transform(records)
        df = pd.DataFrame(data=encoded_X, columns=encoder.feature_names_)
        
        return df.drop(['day=Fri', 'sex=Male', 'smoker=No', 'time=Dinner'], axis =1)
one_hot_X_revised = one_hot_encode_revised(X)

which outputs this: enter image description here

Then, I use this function to estimate theta from the above equation:

def get_analytical_sol(X, y):
"""
Computes the analytical solution to our least squares problem

Parameters
-----------
X: a 2D dataframe of numeric features (one-hot encoded)
y: a 1D vector of tip amounts

Returns
-----------
The estimate for theta
"""
return np.linalg.inv(X.T * X) * (X.T * y)

to run this:

revised_analytical_thetas = get_analytical_sol(one_hot_X_revised, tips)

My error is : ValueError: Unable to coerce to DataFrame, shape must be (8, 244): given (252, 252)

For reference, tips is this:

enter image description here

Did I get rid of the redundancies correctly and if yes, why do I still have the error?

Thanks!


Solution

  • You have an error in this line return np.linalg.inv(X.T * X) * (X.T * y). What you want to do is a matrix multiplication. In pandas dataframes, the sign * is not used for matrix multiplication. You need to use @ or the dot() method of the dataframe.