python machine-learning scikit-learn snowflake-cloud-data-platform logistic-regression

Unsure why I'm running into Type Errors in Logistic Regression using Python on Snowflake?

I am creating a logistic regression model on Snowflake using Python. I did the same logistic regression in R locally, but want to transition it to my Snowflake data warehouse. I'm having some success, but I'm not nearly as familiar with python as I am with R.

I believe that the regression is fitting and giving a model. I don't really know what the predicted probabilities look like, but that is genuinely a secondary concern at this point.

I just want to return a snowflake DataFrame from a pandas DataFrame. I can't get it to happen.

Below is a snippet of my code.

import snowflake.snowpark as snowpark
import snowflake.snowpark.functions as F
from sklearn.linear_model import LogisticRegression
from snowflake.snowpark.functions import col
import pandas as pd

def main(session: snowpark.Session): 
#
# EVERYTHING BEFORE WHAT'S BELOW IS DATA TRANSFORMATION, ALL OF IT WORKS JUST FINE
# AS FAR AS I KNOW

# ind_cols and dep_cols are arrays of column names 
# defining which columns are independent variables and which are dependent.
# Here I split the sample into independent and dependent columns, 
# and use LogisticRegression from scikit-learn.

    X = full_sample[ind_cols].to_pandas()
    y = full_sample[dep_col].to_pandas()

# ret_df is the snowflake DataFrame that I'm interested in predicting probabilities for.
    ret_df_lm = ret_df[ind_cols].to_pandas()

    lm = LogisticRegression()

    lm.fit(X, y)

    y_pred = lm.predict_proba(ret_df_lm)

    y_final = session.table(y_pred)

    #retention_pred = lm.predict(ret_df)

    return y_final

When I try to return y_final I get an error TypeError: sequence item 0: expected str instance, numpy.ndarray found. I've got to be missing something. I've tried other things, like snowflake's session.write_pandas() but I'm not sure it's what I need.

How do I get y_final to be a snowflake DataFrame?

Solution

I fixed your code with the following observations:

I had to generate random data.
The original error came from session.table(y_pred) as it expects an input string, not a data frame.
To return a Snowpark DataFrame you need to transform the Pandas one: return session.create_dataframe(y_final).

# The Snowpark package is required for Python Worksheets. 
# You can add more packages by selecting them using the Packages control and then importing them.

import snowflake.snowpark as snowpark
import snowflake.snowpark.functions as F
from sklearn.linear_model import LogisticRegression
from snowflake.snowpark.functions import col
import pandas as pd
import numpy as np


def main(session: snowpark.Session): 
    #X = full_sample[ind_cols].to_pandas()
    #y = full_sample[dep_col].to_pandas()

    # Number of samples and features
    n_samples = 100  # for example, 100 samples
    n_features = 5   # for example, 5 features
    
    # Generate random data for X
    np.random.seed(0)  # for reproducibility
    X_data = np.random.rand(n_samples, n_features)
    X = pd.DataFrame(X_data, columns=[f'feature_{i}' for i in range(n_features)])
    
    # Generate random binary data for y
    y_data = np.random.randint(2, size=n_samples)
    y = pd.DataFrame(y_data, columns=['target'])

    # ret_df is the snowflake DataFrame that I'm interested in predicting probabilities for.
    # ret_df_lm = ret_df[ind_cols].to_pandas()
    ret_df_data = np.random.rand(n_samples, n_features)
    ret_df = pd.DataFrame(ret_df_data, columns=[f'feature_{i}' for i in range(n_features)])

    lm = LogisticRegression()

    lm.fit(X, y)

    y_pred = lm.predict_proba(ret_df)

    # y_final = session.table(y_pred)

    #retention_pred = lm.predict(ret_df)
    y_final = pd.DataFrame(y_pred, columns=['Prob_0', 'Prob_1'])

    # return a Snowpark DataFrame instead of a Pandas one
    return session.create_dataframe(y_final)