Search code examples
pythondataframecsvpysparkmicrosoft-fabric

Write CSV-like text response from Python request to PySpark Dataframe


I'm working in a Microsoft Fabric notebook, which uses PySpark. The API response has header 'Content-Type' = 'text/csv'. From what I see, the response.text seems to be similar to '"ColA","ColB","ColC"\r"Row1A","Row1B","Row1C", ...'. In other words, from how it's formatted, I can copy the raw Postman output, paste it in Notepad, save as .csv and it looks fine when opened up in Excel.

The response.text is about 45MB and 16K-17K rows. I would like to put the data in a DataFrame (so I can use the DF to overwrite my Fabric Lakehouse table).

I've attempted df = spark.read.csv(responseText, header=False, schema=mySchema). The data and my schema has 64 fields. This took over 15 minutes before I cancelled the run, since I wasn't sure if I did something wrong. I plan to try it again with header=True and inferSchema=True tomorrow, but firstly, is there preprocessing I have to do to response.text before using spark.read.csv() on it? Is it supposed to take this long? Is there a more efficient way to convert it to a DataFrame? (Example: can I save the response as a .csv file within the notebook, then read.csv() from that?)

(For the Fabric-adept, is it possible to save the text as a file.csv and upload it to the Lakehouse 'Files' section, and turn it into a table there? In a way where I can do this programmatically by running the notebook?)

Thank you!


Solution

  • spark.read.csv expects a filename, not the contents of the file.

    So write the data to a lakehouse file first, like this:

    text= """
    "ColA","ColB","ColC"
    "Row1A","Row1B","Row1C"
    "Row2A","Row2B","Row2C"
    """
    
    with open('/lakehouse/default/Files/foo.csv', 'w') as f:
        f.write(text)
    
    df = spark.read.csv("Files/foo.csv",header=True);
    display(df)