Search code examples
scalacsvapache-sparkamazon-s3amazon-emr

Exporting Spark DataFrame to S3


So after certain operations I have some data in a Spark DataFrame, to be specific, org.apache.spark.sql.DataFrame = [_1: string, _2: string ... 1 more field]

Now when I do df.show(), I get the following output, which is expected.

+--------------------+--------------------+--------------------+
|                  _1|                  _2|                  _3|
+--------------------+--------------------+--------------------+
|industry_name_ANZSIC|'industry_name_AN...|.isComplete("indu...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
|        rme_size_grp|'rme_size_grp' is...|.isComplete("rme_...|
|        rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
|        rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
|                year|  'year' is not null| .isComplete("year")|
|                year|'year' has type I...|.hasDataType("yea...|
|                year|'year' has no neg...|.isNonNegative("y...|
|industry_code_ANZSIC|'industry_code_AN...|.isComplete("indu...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
|            variable|'variable' is not...|.isComplete("vari...|
|            variable|'variable' has va...|.isContainedIn("v...|
|                unit|  'unit' is not null| .isComplete("unit")|
|                unit|'unit' has value ...|.isContainedIn("u...|
|               value| 'value' is not null|.isComplete("value")|
+--------------------+--------------------+--------------------+

The problem occurs when I try exporting the dataframe as a csv to my S3 bucket.

The code I have is : df.coalesce(1).write.mode("Append").csv("s3://<my path>")

But the csv generated in my S3 path is full of gibberish or rich text. Also, the spark prompt doesn't reappear after execution (meaning execution didn't finish?) Here's a sample screenshot of the generated csv in my S3 :

enter image description here

What am I doing wrong and how do I rectify this?


Solution

  • S3: short description.

    When you change the letter on the URI scheme, it will make a big difference because it causes different software to be used to interface to S3.

    This is the difference between the three:

    s3 is a block-based overlay on top of Amazon S3,whereas s3n/s3a are not. These are are object-based.

    s3n supports objects up to 5GB when size is the concern, while s3a supports objects up to 5TB and has higher performance.Note that s3a is the successor to s3n.