Search code examples
dataframeapache-sparkstringio

How to use StringIO(file.read()) to create a Spark dataframe


I have a very simple csv file. It is pretty easy to get the records loaded into a pandas dataframe in the following way. However, what I really need is to get it loaded into a spark dataframe.

How could I directly use StringIO(f.read()) to get the records into a spark dataframe directly, instead of converting a df_pandas to a df_spark?

Thank you very much!

f = open("C:\\myfolder\\test.csv", "r")
df_pandas = pd.read_csv(StringIO(f.read()), sep=";")
#df_spark = spark.read.csv(StringIO(f.read()))  # this doesn't work
f.close()

Solution

  • You could convert the pandas dataframe to a spark dataframe:

    f = open("C:\\myfolder\\test.csv", "r")
    df_pandas = pd.read_csv(StringIO(f.read()), sep=";")
    df_spark = spark.createDataFrame(df_pandas)
    f.close()
    

    This does not make sense, if you create your StringIO object from a local file, as you could directly load the file with spark.read.csv("C:\\myfolder\\test.csv", sep=";"), but could make sense if you got your StringIO object from another string (e.g. a FileUpload ipython widget).