Count distinct users from RDD

I have json file which I loaded into my program using textFile. I want to count the number of distinct users in my json data. I cannot convert to DataFrame or Dataset. I tried the following code it gave me some java EOF error.

jsonFile = sc.textFile('some.json')
dd = jsonFile.filter(lambda x: x[1]).distinct().count()
# 2nd column is user ID coulmn

Sample data

{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,text":"Total bill for this horrible service? Over $8Gs","date":"2013-05-07 04:34:36"}

Solution

use :

spark.read.json(Json_File, multiLine=True)

to directly read json into dataframe

Try multiLine as True and False as per your Files requirement

How to reinstall same version of a wheel on Databricks without cluster restart
Pyenv - Switching between Python and PySpark versions without hardcoding environment variable paths for python
Pyspark on GCP Dataproc - Partial reading of data for gzip encoded Cloud Storage files
Pyspark Streaming data to Elastic search index from Kafka topic , running in Jupyter notebook, causing failure
How to handle an AnalysisException on Spark SQL?
Share cluster params between jobs
How convert a list into multiple columns and a dataframe?
PySpark Window functions: Aggregation differs if WindowSpec has sorting
Add quote for pyspark dataframe column with regular expressions
Using rangeBetween considering months rather than days in PySpark
Pyspark replace strings in Spark dataframe column
How to specify file size using repartition() in spark
Spark reading from mutiple SQL databases in parallel
Spark partition size greater than the executor memory
Last day of quarter
polars groupby and pivot converting code from pyspark
corrupted record from json file in pyspark due to False as entry
Chain several WHEN conditions in a scalable way in PySpark
How to compute the total number of words in a text file
Not able to select more than 255 columns from Pyspark DataFrame
In Foundry, how to read "added" rows in output since the last built?
Rank on a subset of a partition - PySpark
How to extract all elements from array of structs?
How to get the schema definition from a dataframe in PySpark?
Get index of item in array column in a Spark dataframe
pyspark overwrite silently failed to remove stale parquet files
Great expectation: get invalid records
How convert CSV table structure to JSON using Python?
pyspark convert comma seperated string into dataframe
checksum error while writing data to delta table. Is there a way to fix this issue?