I have the following pyspark code named sample.py with print statement
import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as f
from datetime import datetime
from time import time
if __name__ == '__main__':
spark = SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()
print("Print statement-1")
schema = StructType([
StructField("author", StringType(), False),
StructField("title", StringType(), False),
StructField("pages", IntegerType(), False),
StructField("email", StringType(), False)
])
data = [
["author1", "title1", 1, "[email protected]"],
["author2", "title2", 2, "[email protected]"],
["author3", "title3", 3, "[email protected]"],
["author4", "title4", 4, "[email protected]"]
]
df = spark.createDataFrame(data, schema)
print("Number of records",df.count())
sys.exit(0)
the below spark-submit with sample.log is not printing the print statement
spark-submit --master yarn --deploy-mode cluster sample.py > sample.log
The scenario is we want to print something information in the log file so that after the spark job completes based on that the print statement in log file we will do some other actions.
Please help me on this
The print statements will not be found in the spark-submit logs but rather in the yarn logs. When you do spark-submit you will get an application ID which looks like this application_1234567890123_12345
.
Now run the following command with the application Id to get the aggregated yarn logs after the spark job has completed.
yarn logs -applicationId <applicationId>