Search code examples
pythondataframeapache-sparkpysparkaws-glue

How convert a list into multiple columns and a dataframe?


i have a challenge today, is: Having a list of s3 paths, inside a list, split this and get a dataframe with one column with the path and a new column with just the name of the folder.

my list have the next content:

raw/ingest_date=20240918/eventos/
raw/ingest_date=20240918/llamadas/
raw/ingest_date=20240918/campanhas/
raw/ingest_date=20240918/miembros/
raw/ingest_date=20240918/objetivos/

i try this code:

new_dict = []
for folder in subfolders:
    new_dict.append(folder)
    name = folder.split("/", -1)
    new_dict.append(name[2])
    #print(name)

print(type(new_dict))
for elem in new_dict:
    print(elem) 

df = spark.createDataFrame(new_dict, ["s3_prefix", "table_name"])
df.show()

but the result is a list like:

raw/ingest_date=20240918/eventos/
eventos
raw/ingest_date=20240918/llamadas/
llamadas
raw/ingest_date=20240918/campanhas/
campanhas
...
...

but when I try to print my dataframe i see this:

TypeError: Can not infer schema for type: <class 'str'>

the idea is have a dataframe like :

s3_prefix                            | table_name
------------------------------------------------------
raw/ingest_date=20240918/eventos/    | eventos
raw/ingest_date=20240918/llamadas/   | llamadas
raw/ingest_date=20240918/campanhas/  | campanhas
raw/ingest_date=20240918/miembros/   | miembros

Can somebody give a hand to resolve this?

Regards


Solution

  • Just use tuples in this case or a list of tuples, the first element is the full path (s3_prefix) and the second element is the last folder name in your case is the table name.

    data_T is the list of tuples (s3_prefix, table_name)

    data_T =[(folder, folder.split('/')[-2]) for folder in subfolders]
    

    and then

    df = spark.createDataFrame(data_T , ["s3_prefix", "table_name"])