Search code examples
pythonsnowflake-cloud-data-platform

How do you extract an array from a Snowpark DataFrame without converting it to a string?


I'm trying to extract an array from a Snowpark DataFrame, however, when I actually get it out of the DataFrame using .collect() it has been converted to a string. Is there any way to get it out as a list[str]? This also seems to happen when converting to pandas using .to_pandas().

Example:

dataframe = session.create_dataframe([[["some", "data", "here"]]], ["data"])
collected_array = dataframe.collect()[0][0]

dataframe.printSchema()

print(type(collected_array), "\n\n", collected_array, "\n\n", collected_array[0])

Output:

root
 |-- "DATA": ArrayType(StringType()) (nullable = True)
<class 'str'> 

[
  "some",
  "data",
  "here"
]

[

I would expect collected_array to be of type list[str] since it is ArrayType(StringType()) when in the DataFrame.


Solution

  • I asked the same question on the Snowflake discourse and this is the answer a Snowflake employee gave me:

    DataFrame API doesn’t currently support returning arrays or structs as Python lists or dictionaries. The returned data is a Row object, and the array is represented as a string.

    As an alternative to json.loads(). I tried using the ast.literal_eval() function as a less hacky way to convert the string to a list.

    import ast
    
    dataframe = session.create_dataframe([[["some", "data", "here"]]], ["data"])
    collected_array = dataframe.collect()[0][0]
    dataframe.printSchema()
    # Convert the string representation of list to a list
    list_data = ast.literal_eval(collected_array)
    
    print(type(list_data), "\n\n", list_data)
    

    Output:

    root
    |-- "DATA": ArrayType(StringType()) (nullable = True)
    <class 'list'> 
        
    ['some', 'data', 'here']