i have a challenge today, is: Having a list of s3 paths, inside a list, split this and get a dataframe with one column with the path and a new column with just the name of the folder.
my list have the next content:
raw/ingest_date=20240918/eventos/
raw/ingest_date=20240918/llamadas/
raw/ingest_date=20240918/campanhas/
raw/ingest_date=20240918/miembros/
raw/ingest_date=20240918/objetivos/
i try this code:
new_dict = []
for folder in subfolders:
new_dict.append(folder)
name = folder.split("/", -1)
new_dict.append(name[2])
#print(name)
print(type(new_dict))
for elem in new_dict:
print(elem)
df = spark.createDataFrame(new_dict, ["s3_prefix", "table_name"])
df.show()
but the result is a list like:
raw/ingest_date=20240918/eventos/
eventos
raw/ingest_date=20240918/llamadas/
llamadas
raw/ingest_date=20240918/campanhas/
campanhas
...
...
but when I try to print my dataframe i see this:
TypeError: Can not infer schema for type: <class 'str'>
the idea is have a dataframe like :
s3_prefix | table_name
------------------------------------------------------
raw/ingest_date=20240918/eventos/ | eventos
raw/ingest_date=20240918/llamadas/ | llamadas
raw/ingest_date=20240918/campanhas/ | campanhas
raw/ingest_date=20240918/miembros/ | miembros
Can somebody give a hand to resolve this?
Regards
Just use tuples in this case or a list of tuples, the first element is the full path (s3_prefix) and the second element is the last folder name in your case is the table name.
data_T is the list of tuples (s3_prefix, table_name)
data_T =[(folder, folder.split('/')[-2]) for folder in subfolders]
and then
df = spark.createDataFrame(data_T , ["s3_prefix", "table_name"])