Search code examples
pythonmongodb

Struggling to upload data onto MongoDB


I am trying to upload data onto my MongoDB Atlas database. The data are .csv files from the Kaggle Movies Dataset https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset. I have tried different methods, but nothing seems to work as it should.

Method 1 I first tried to upload the data using a Python text editor as follows:

from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
import pandas as pd

uri = "..."
client = MongoClient(uri, server_api=ServerApi('1'))
db = client['db']

def load_csv_to_mongodb_batch(collection_name, file_path, batch_size=1000):
    df = pd.read_csv(file_path)
    data = df.to_dict(orient='records')
    collection = db[collection_name]
    
    # Insert data in batches
    for i in range(0, len(data), batch_size):
        batch = data[i:i + batch_size]
        collection.insert_many(batch)
        print(f"Inserted batch {i//batch_size + 1} into {collection_name} collection.")

datasets = {
    "Credits": "/path_to/credits.csv",
    "Keywords": "/path_to/keywords.csv",
    "Films": "/path_to/movies_metadata.csv",
}

for collection, file_path in datasets.items():
    load_csv_to_mongodb_batch(collection, file_path)

This uploads my data in the correct collections, however, some of my data in the .csv files is of the format [{'id': 1, 'name': 'Action'}, {'id':2, 'name':'Adenture}], and these data entries get uploaded as strings instead of subdocuments (see the genres field in the screenshot), which makes querying difficult/impossible Strings instead of Subdocuments

Method 2 I also tried to use mongoimport as follows: mongoimport --uri "..." --collection Films --drop --file /path_to/movies_metadata.csv. However, here I get the following message:

connected to: mongodb+srv://...
dropping: test.Films
Failed: error processing document #1: invalid character 'a' looking for beginning of value
0 document(s) imported successfully. 0 document(s) failed to import.

So, I establish a connection, drop the collection, but then it fails to import any of my data.

EDIT: Based on the comments I have added the keywords: --type=csv --headerlineto the prompt, which allows me to upload the data, but in this method, it is still uploading the data as strings instead of subdocuments.

Method 3 Finally, I even tried to convert my csv files to JSON files by putting double quotations around all keys and values. Here I converted the data as follows and then saved it as json files:

def preprocess_columns(field):
    if isinstance(field, str) and field.startswith('[') and field.endswith(']'): # deals with JSON lists
        field = field.replace("'", '"') # Replace single quotes with double quotes
        field = re.sub(r'(\b\w+\b):', r'"\1":', field) # Add double quotes around keys if missing
        return field
    elif isinstance(field, str) and field.startswith('{') and field.endswith('}'): # deals with JSON dictionaries
        field = field.replace("'", '"') 
        field = re.sub(r'(\b\w+\b):', r'"\1":', field) 
        return field
    elif isinstance(field, (int, float, bool)): # if single boolean or number --> does not need ".."
        return field
    elif not isinstance(field, (int, float, bool)): # if single string value --> need ".."
        field = field.replace('"', "'") # in case there are double quotations are part of text
        return f'"{field}"' 
    else: 
        return field

However, in this case, when using mongoimport, I get the message: 'Failed: cannot decode array into a primitive.D'

In the last method, I may have just made a mistake in converting, but I don't understand why the other two methods don't work.

I have been stuck on this for a while now, and have tried different ways, so any help would be really appreciated!

EDIT: Here are the first two films of the movies_metadata.csv file

adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
FALSE,"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",30/10/1995,373554033,81,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,FALSE,7.7,5415
FALSE,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",,8844,tt0113497,en,Jumanji,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",15/12/1995,262797249,104,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]",Released,Roll the dice and unleash the excitement!,Jumanji,FALSE,6.9,2413

Here is the first entry of the credits.csv file (to fit within the text limit, I had to shorten the lists)

cast,crew,id
"[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}]", "[{'credit_id': '52fe4284c3a36847f8024f49', 'department': 'Directing', 'gender': 2, 'id': 7879, 'job': 'Director', 'name': 'John Lasseter', 'profile_path': '/7EdqiNbr4FRjIhKHyPPdFfEEEFG.jpg'}, {'credit_id': '52fe4284c3a36847f8024f4f', 'department': 'Writing', 'gender': 2, 'id': 12891, 'job': 'Screenplay', 'name': 'Joss Whedon', 'profile_path': '/dTiVsuaTVTeGmvkhcyJvKp2A5kr.jpg'}, {'credit_id': '52fe4284c3a36847f8024f55', 'department': 'Writing', 'gender': 2, 'id': 7, 'job': 'Screenplay', 'name': 'Andrew Stanton', 'profile_path': '/pvQWsu0qc8JFQhMVJkTHuexUAa1.jpg'}, {'credit_id': '52fe4284c3a36847f8024f5b', 'department': 'Writing', 'gender': 2, 'id': 12892, 'job': 'Screenplay', 'name': 'Joel Cohen', 'profile_path': '/dAubAiZcvKFbboWlj7oXOkZnTSu.jpg'}]", 862

Solution

  • movies_metadata.csv is in an awkward format, seemingly a mix of typical CSV and strings of JSON.

    This is rather fragile, but here's a way using csvjson (from csvkit) and jq to convert your example CSV to JSON that mongoimport should accept.

    This is for a bash commandline (a different shell will probably have alternate parameter expansion/interpolation syntax):

    csvjson --stream -I ./movies_metadata.csv |jq -c $'with_entries(if IN(.key;"belongs_to_collection", "genres", "production_companies", "production_countries", "spoken_languages") then .value = ((.value|gsub("\'";"\\"")? // .)|fromjson? // .) elif .key=="release_date" then .value={"$date": (.value|strptime("%d/%m/%Y")|todate)} else .value=(.value|(.|tonumber? // .=="TRUE" // if .=="FALSE" then false else . end))  end )' > ./movies_metadata.json
    

    Assuming the format is acceptable to mongoimport, you could skip writing the JSON file and just pipe the output of the above pipe into mongoimport.

    Here's what the output of:

    csvjson --stream -I ./movies_metadata.csv |jq $'with_entries(if IN(.key;"belongs_to_collection", "genres", "production_companies", "production_countries", "spoken_languages") then .value = ((.value|gsub("\'";"\\"")? // .)|fromjson? // .) elif .key=="release_date" then .value={"$date": (.value|strptime("%d/%m/%Y")|todate)} else .value=(.value|(.|tonumber? // .=="TRUE" // if .=="FALSE" then false else . end))  end )'
    

    ... looks like (note the use of "$date" for proper MongoDB date type):

    {
      "adult": false,
      "belongs_to_collection": {
        "id": 10194,
        "name": "Toy Story Collection",
        "poster_path": "/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg",
        "backdrop_path": "/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg"
      },
      "budget": 30000000,
      "genres": [
        {
          "id": 16,
          "name": "Animation"
        },
        {
          "id": 35,
          "name": "Comedy"
        },
        {
          "id": 10751,
          "name": "Family"
        }
      ],
      "homepage": "http://toystory.disney.com/toy-story",
      "id": 862,
      "imdb_id": "tt0114709",
      "original_language": "en",
      "original_title": "Toy Story",
      "overview": "Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",
      "popularity": 21.946943,
      "poster_path": "/rhIRbceoE9lR4veEXuwCC2wARtG.jpg",
      "production_companies": [
        {
          "name": "Pixar Animation Studios",
          "id": 3
        }
      ],
      "production_countries": [
        {
          "iso_3166_1": "US",
          "name": "United States of America"
        }
      ],
      "release_date": {
        "$date": "1995-10-30T00:00:00Z"
      },
      "revenue": 373554033,
      "runtime": 81,
      "spoken_languages": [
        {
          "iso_639_1": "en",
          "name": "English"
        }
      ],
      "status": "Released",
      "tagline": null,
      "title": "Toy Story",
      "video": false,
      "vote_average": 7.7,
      "vote_count": 5415
    }
    {
      "adult": false,
      "belongs_to_collection": null,
      "budget": 65000000,
      "genres": [
        {
          "id": 12,
          "name": "Adventure"
        },
        {
          "id": 14,
          "name": "Fantasy"
        },
        {
          "id": 10751,
          "name": "Family"
        }
      ],
      "homepage": null,
      "id": 8844,
      "imdb_id": "tt0113497",
      "original_language": "en",
      "original_title": "Jumanji",
      "overview": "When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",
      "popularity": 17.015539,
      "poster_path": "/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg",
      "production_companies": [
        {
          "name": "TriStar Pictures",
          "id": 559
        },
        {
          "name": "Teitler Film",
          "id": 2550
        },
        {
          "name": "Interscope Communications",
          "id": 10201
        }
      ],
      "production_countries": [
        {
          "iso_3166_1": "US",
          "name": "United States of America"
        }
      ],
      "release_date": {
        "$date": "1995-12-15T00:00:00Z"
      },
      "revenue": 262797249,
      "runtime": 104,
      "spoken_languages": [
        {
          "iso_639_1": "en",
          "name": "English"
        },
        {
          "iso_639_1": "fr",
          "name": "Français"
        }
      ],
      "status": "Released",
      "tagline": "Roll the dice and unleash the excitement!",
      "title": "Jumanji",
      "video": false,
      "vote_average": 6.9,
      "vote_count": 2413
    }