I can't manage to read a JSON file in Python with pyspark because it has multiple records with each variable on a different line.
Exemple :
{
"id" : "id001",
"name" : "NAME001",
"firstname" : "FIRSTNAME001"
}
{
"id" : "NNI002",
"name" : "NAME002",
"firstname" : "FIRSTNAME002"
}
{
"id" : "NNI003",
"name" : "NAME003",
"firstname" : "FIRSTNAME003"
}
I want to load it as such :
+------------+------+-------+
| firstname| id| name|
+------------+------+-------+
|FIRSTNAME001| id001|NAME001|
|FIRSTNAME002|NNI002|NAME002|
|FIRSTNAME003|NNI003|NAME003|
+------------+------+-------+
I got errors if I try spark.read.json("file.json")
.
And when I usespark.read.option("multiline","true").json("file.json")
I got only the first record :
+------------+-----+-------+
| firstname| id| name|
+------------+-----+-------+
|FIRSTNAME001|id001|NAME001|
+------------+-----+-------+
I can read it with spark.read.json("file.json")
if I put every records in there own line :
{ "id" : "id001", "name" : "NAME001", "firstname" : "FIRSTNAME001" }
{ "id" : "NNI002", "name" : "NAME002", "firstname" : "FIRSTNAME002" }
{ "id" : "NNI003", "name" : "NAME003", "firstname" : "FIRSTNAME003" }
But as I got 10M lines, it's not realy an option.
If any one got ideas to help me I would really appreciate.
Thanks a lot.
@Niveditha S and @blackhorse0101 pointing bad JSON formating made me search for a way to correct the formating on a very large file.
Here is the code I used to correct the file in few seconds.
(take note that I use nest_asyncio
as I'm in a Jupyter Notebook, you can use only aiofiles for a python script)
import aiofiles
import nest_asyncio
nest_asyncio.apply()
async def process_file(input_file, output_file):
async with aiofiles.open(input_file, mode='r') as f:
content = await f.read()
# Add ',' after '}' if there is none
lines = content.split('\n')
for i in range(len(lines) - 1):
if '}' in lines[i] and lines[i].strip().endswith('}') and not lines[i+1].strip().startswith(','):
lines[i] = lines[i].rstrip() + ','
# Remove last ','
last_line = lines[-2]
if last_line.endswith('},'):
lines[-2] = last_line.rstrip(',')
content = '\n'.join(lines)
content = '[' + content + ']'
async with aiofiles.open(output_file, mode='w') as f:
await f.write(content)
input_file = 'file.json'
output_file = 'file_corrected.json'
await process_file(input_file, output_file)
Then spark.read.option("multiline","true").json("file_corrected.json")
to read the corrected file