So after certain operations I have some data in a Spark DataFrame, to be specific, org.apache.spark.sql.DataFrame = [_1: string, _2: string ... 1 more field]
Now when I do df.show()
, I get the following output, which is expected.
+--------------------+--------------------+--------------------+
| _1| _2| _3|
+--------------------+--------------------+--------------------+
|industry_name_ANZSIC|'industry_name_AN...|.isComplete("indu...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
| rme_size_grp|'rme_size_grp' is...|.isComplete("rme_...|
| rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
| rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
| year| 'year' is not null| .isComplete("year")|
| year|'year' has type I...|.hasDataType("yea...|
| year|'year' has no neg...|.isNonNegative("y...|
|industry_code_ANZSIC|'industry_code_AN...|.isComplete("indu...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
| variable|'variable' is not...|.isComplete("vari...|
| variable|'variable' has va...|.isContainedIn("v...|
| unit| 'unit' is not null| .isComplete("unit")|
| unit|'unit' has value ...|.isContainedIn("u...|
| value| 'value' is not null|.isComplete("value")|
+--------------------+--------------------+--------------------+
The problem occurs when I try exporting the dataframe as a csv to my S3 bucket.
The code I have is : df.coalesce(1).write.mode("Append").csv("s3://<my path>")
But the csv generated in my S3 path is full of gibberish or rich text. Also, the spark prompt doesn't reappear after execution (meaning execution didn't finish?) Here's a sample screenshot of the generated csv in my S3 :
What am I doing wrong and how do I rectify this?
S3: short description.
When you change the letter on the URI scheme, it will make a big difference because it causes different software to be used to interface to S3.
This is the difference between the three:
s3 is a block-based overlay on top of Amazon S3,whereas s3n/s3a are not. These are are object-based.
s3n supports objects up to 5GB when size is the concern, while s3a supports objects up to 5TB and has higher performance.Note that s3a is the successor to s3n.