I have json file which I loaded into my program using textFile. I want to count the number of distinct users in my json data. I cannot convert to DataFrame or Dataset. I tried the following code it gave me some java EOF error.
jsonFile = sc.textFile('some.json')
dd = jsonFile.filter(lambda x: x[1]).distinct().count()
# 2nd column is user ID coulmn
Sample data
{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,text":"Total bill for this horrible service? Over $8Gs","date":"2013-05-07 04:34:36"}
use :
spark.read.json(Json_File, multiLine=True)
to directly read json into dataframe
Try multiLine as True and False as per your Files requirement