Search code examples
pysparkrddapache-spark-sql

Count distinct users from RDD


I have json file which I loaded into my program using textFile. I want to count the number of distinct users in my json data. I cannot convert to DataFrame or Dataset. I tried the following code it gave me some java EOF error.

jsonFile = sc.textFile('some.json')
dd = jsonFile.filter(lambda x: x[1]).distinct().count()
# 2nd column is user ID coulmn

Sample data

{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,text":"Total bill for this horrible service? Over $8Gs","date":"2013-05-07 04:34:36"}

Solution

  • use :

    spark.read.json(Json_File, multiLine=True)
    

    to directly read json into dataframe

    Try multiLine as True and False as per your Files requirement