I have a Snowflake table with an ARRAY column containing custom embeddings (with array size>1000).
These arrays are sparse, and I would like to reduce their dimension with SVD (or one of the Snowpark ml.modeling.decomposition
methods).
A toy example of the dataframe would be:
df = session.sql("""
select 'doc1' as doc_id, array_construct(0.1, 0.3, 0.5, 0.7) as doc_vec
union
select 'doc2' as doc_id, array_construct(0.2, 0.4, 0.6, 0.8) as doc_vec
""")
print(df)
# DOC_ID | DOC_VEC
# doc1 | [ 0.1, 0.3, 0.5, 0.7 ]
# doc2 | [ 0.2, 0.4, 0.6, 0.8 ]
However, when I try to fit this dataframe
from snowflake.ml.modeling.decomposition import TruncatedSVD
tsvd = TruncatedSVD(input_cols = 'doc_vec', output_cols='out_svd')
print(tsvd)
out = tsvd.fit(df)
I get
File "snowflake/ml/modeling/_internal/snowpark_trainer.py", line 218, in fit_wrapper_function
args = {"X": df[input_cols]}
~~^^^^^^^^^^^^ File "pandas/core/frame.py", line 3767, in __getitem__
indexer = self.columns._get_indexer_strict(key, "columns")[1]
<...snip...>
KeyError: "None of [Index(['doc_vec'], dtype='object')] are in the [columns]"
Based on the information in this tutorial text_embedding_as_snowpark_python_udf,
I suspect the Snowpark array needs to be converted to a np.ndarray
before being fed to underlying sklearn.decomposition.TruncatedSVD
Can someone point me to any example using Snoflake arrays as inputs to the Snowpark models, please?
The problem right now is that Snowflake currently doesn't support sparse matrix (but it will).
A teammate wrote this sample code that will be supported in the future:
from snowflake.ml.modeling.decomposition import TruncatedSVD
from snowflake.ml.utils.connection_params import SnowflakeLoginOptions
from snowflake.snowpark import Session, functions as F, types as T
session = Session.builder.configs(SnowflakeLoginOptions()).getOrCreate()
# This can not work right now because snowflake ml doesn't accept input as array type so far... We'll support it in the future!
t = session.range(5).with_column(
"doc_vec",
F.array_construct(
F.lit(0.1),
F.lit(0.2),
F.lit(0.3),
),
).with_column("doc_vec", F.col("doc_vec").cast(T.ArrayType(T.FloatType())))
tsvd = TruncatedSVD(input_cols="DOC_VEC", output_cols="DOC_VEC")
# create a dataframe as input
t = session.create_dataframe([[0.1, 0.2, 0.3] for _ in range(5)], schema=["A", "B", "C"])
tsvd = TruncatedSVD(input_cols=["A", "B", "C"], output_cols=["OUTPUT"])
t.show()
tsvd.fit(t)
# show the results
tsvd.transform(t).show()