I'm trying to extract an array from a Snowpark DataFrame, however, when I actually get it out of the DataFrame using .collect()
it has been converted to a string. Is there any way to get it out as a list[str]
? This also seems to happen when converting to pandas using .to_pandas()
.
Example:
dataframe = session.create_dataframe([[["some", "data", "here"]]], ["data"])
collected_array = dataframe.collect()[0][0]
dataframe.printSchema()
print(type(collected_array), "\n\n", collected_array, "\n\n", collected_array[0])
Output:
root
|-- "DATA": ArrayType(StringType()) (nullable = True)
<class 'str'>
[
"some",
"data",
"here"
]
[
I would expect collected_array
to be of type list[str]
since it is ArrayType(StringType())
when in the DataFrame.
I asked the same question on the Snowflake discourse and this is the answer a Snowflake employee gave me:
DataFrame API doesn’t currently support returning arrays or structs as Python lists or dictionaries. The returned data is a Row object, and the array is represented as a string.
As an alternative to json.loads(). I tried using the ast.literal_eval() function as a less hacky way to convert the string to a list.
import ast
dataframe = session.create_dataframe([[["some", "data", "here"]]], ["data"])
collected_array = dataframe.collect()[0][0]
dataframe.printSchema()
# Convert the string representation of list to a list
list_data = ast.literal_eval(collected_array)
print(type(list_data), "\n\n", list_data)
Output:
root
|-- "DATA": ArrayType(StringType()) (nullable = True)
<class 'list'>
['some', 'data', 'here']