Search code examples
pythonmongodbcsvpipepymongo

How to convert pipe delimited to CSV or JSON


I have a ~4GB txt file which is pipe delimited. I am trying to import this text to MongoDB but as you know MongoDB supports only JSON and CSV files. Below is the code so far.

import pandas as pd
import csv
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
# Creating Database Office
db = client.Office
# Creating Collection Customers
customers = db.Customers

filename = "Names.txt"
data_df = pd.read_fwf(filename, sep="|", engine="python", encoding="latin-1")
fileout = "Names.csv"
output = data_df.to_csv(fileout, sep=",")
print("Finished")
fin = open("Names.csv", "r")
file_data = fin.read()
file_csv = csv.reader(file_data)
Customers.insert_many(file_csv)

The input file "Name.txt" looks like this

Reg|Name|DOB|Friend|Nationality|Profession^M
1122|Sam|01/01/2001|John|USA|Lawyer^M
2456|George|05/10/1999|Pit|Canada|Engineer^M
5645|Brad|02/06/2000|Adam|UK|Doctor^M

If the provided text file is CSV then simply import it to MongoDB or if the txt file is pipe delimited or any other delimited then import it to MongoDB after only after processing the text file to a CSV file. The CSV file that I get in fileout, when imported manually to MongoDB the result looks like this.

col1          col2
id    Reg|Name|DOB|Friend|Nationality|Profession
1     1122|Sam|01/01/2001|John|USA|Lawyer
2     2456|George|05/10/1999|Pit|Canada|Engineer
3     5645|Brad|02/06/2000|Adam|UK|Doctor

What I want to achieve is shown below. This was done with the sed command. First I replaced any "," if in the txt file with "-" using the command

sed -i 's/,/-/g' Names.txt

then I replaced the pipe delimiter with ",":

sed -i 's/|/,/g' Names.txt
col1 col2  col3   col4       col5    col6        col7
id   Reg   Name   DOB        Friend  Nationality Profession
1    1122  Sam    01/01/2001 John    USA         Lawyer
2    2456  George 05/10/1999 Pit     Canada      Engineer
3    5645  Brad   02/06/2000 Adam    UK          Doctor

I know that the code is not doing anything. But I can't figure out how to make it work.

I am new to all type of programming and I have searched through various answers regarding this question and various other related questions in the site, but none fits my needs.

UPDATE

import csv
import json
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
db = client.Office
customer = db.Customer
jsonArray = []

with open("Names.txt", "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        jsonArray.append(row)
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
    jsonfile = json.loads(jsonString)
    customer.insert_many(jsonfile)

This is the new code I came up with after getting some ideas from comments. But now the only problem is I get this error.

Traceback (most recent call last):
  File "E:\Anaconda Projects\Mongo Projects\Office Tool\csvtojson.py", line 16, in <module>
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
  File "C:\Users\Predator\anaconda3\lib\json\__init__.py", line 234, in dumps
    return cls(
  File "C:\Users\Predator\anaconda3\lib\json\encoder.py", line 201, in encode
    chunks = list(chunks)
MemoryError

Solution

  • Finally found the solution.

    I tested it on a 5GB file although slow it still works. It imports all data from a pipe delimited txt file to MongoDB.

    import csv
    import json
    
    from pymongo import MongoClient
    
    url_mongo = "mongodb://localhost:27017"
    client = MongoClient(url_mongo)
    db = client.Office
    customer = db.Customer
    jsonArray = []
    file_txt = "Text.txt"
    rowcount = 0
    with open(file_txt, "r") as txt_file:
        csv_reader = csv.DictReader(txt_file, dialect="excel", delimiter="|", quoting=csv.QUOTE_NONE)
        for row in csv_reader:
            rowcount += 1
            jsonArray.append(row)
        for i in range(rowcount):
            jsonString = json.dumps(jsonArray[i], indent=1, separators=(",", ":"))
            jsonfile = json.loads(jsonString)
            customer.insert_one(jsonfile)
    print("Finished")
    

    Thank You All for your Ideas