Search code examples
pythonmachine-learningscikit-learnsnowflake-cloud-data-platformlogistic-regression

Unsure why I'm running into Type Errors in Logistic Regression using Python on Snowflake?


I am creating a logistic regression model on Snowflake using Python. I did the same logistic regression in R locally, but want to transition it to my Snowflake data warehouse. I'm having some success, but I'm not nearly as familiar with python as I am with R.

I believe that the regression is fitting and giving a model. I don't really know what the predicted probabilities look like, but that is genuinely a secondary concern at this point.

I just want to return a snowflake DataFrame from a pandas DataFrame. I can't get it to happen.

Below is a snippet of my code.

import snowflake.snowpark as snowpark
import snowflake.snowpark.functions as F
from sklearn.linear_model import LogisticRegression
from snowflake.snowpark.functions import col
import pandas as pd

def main(session: snowpark.Session): 
#
# EVERYTHING BEFORE WHAT'S BELOW IS DATA TRANSFORMATION, ALL OF IT WORKS JUST FINE
# AS FAR AS I KNOW

# ind_cols and dep_cols are arrays of column names 
# defining which columns are independent variables and which are dependent.
# Here I split the sample into independent and dependent columns, 
# and use LogisticRegression from scikit-learn.

    X = full_sample[ind_cols].to_pandas()
    y = full_sample[dep_col].to_pandas()

# ret_df is the snowflake DataFrame that I'm interested in predicting probabilities for.
    ret_df_lm = ret_df[ind_cols].to_pandas()

    lm = LogisticRegression()

    lm.fit(X, y)

    y_pred = lm.predict_proba(ret_df_lm)

    y_final = session.table(y_pred)

    #retention_pred = lm.predict(ret_df)

    return y_final

When I try to return y_final I get an error TypeError: sequence item 0: expected str instance, numpy.ndarray found. I've got to be missing something. I've tried other things, like snowflake's session.write_pandas() but I'm not sure it's what I need.

How do I get y_final to be a snowflake DataFrame?


Solution

  • I fixed your code with the following observations:

    • I had to generate random data.
    • The original error came from session.table(y_pred) as it expects an input string, not a data frame.
    • To return a Snowpark DataFrame you need to transform the Pandas one: return session.create_dataframe(y_final).
    # The Snowpark package is required for Python Worksheets. 
    # You can add more packages by selecting them using the Packages control and then importing them.
    
    import snowflake.snowpark as snowpark
    import snowflake.snowpark.functions as F
    from sklearn.linear_model import LogisticRegression
    from snowflake.snowpark.functions import col
    import pandas as pd
    import numpy as np
    
    
    def main(session: snowpark.Session): 
        #X = full_sample[ind_cols].to_pandas()
        #y = full_sample[dep_col].to_pandas()
    
        # Number of samples and features
        n_samples = 100  # for example, 100 samples
        n_features = 5   # for example, 5 features
        
        # Generate random data for X
        np.random.seed(0)  # for reproducibility
        X_data = np.random.rand(n_samples, n_features)
        X = pd.DataFrame(X_data, columns=[f'feature_{i}' for i in range(n_features)])
        
        # Generate random binary data for y
        y_data = np.random.randint(2, size=n_samples)
        y = pd.DataFrame(y_data, columns=['target'])
    
        # ret_df is the snowflake DataFrame that I'm interested in predicting probabilities for.
        # ret_df_lm = ret_df[ind_cols].to_pandas()
        ret_df_data = np.random.rand(n_samples, n_features)
        ret_df = pd.DataFrame(ret_df_data, columns=[f'feature_{i}' for i in range(n_features)])
    
        lm = LogisticRegression()
    
        lm.fit(X, y)
    
        y_pred = lm.predict_proba(ret_df)
    
        # y_final = session.table(y_pred)
    
        #retention_pred = lm.predict(ret_df)
        y_final = pd.DataFrame(y_pred, columns=['Prob_0', 'Prob_1'])
    
        # return a Snowpark DataFrame instead of a Pandas one
        return session.create_dataframe(y_final)