Search code examples
mongodbmongodb-querybigdatapymongomongodb-compass

How to export smaller collection in MongoDB (big data)? Aggregations time out! (any big data help MUCH appreciated!)


This is my first time making an account on Stackoverflow so I apologise if what I am asking is really straightforward.

What I want to do: I have a 14 million documents database of twitter data I wish to analyse. I am trying to query only those that are in a specific language and export that query to a smaller collection so that I can actually perform my analysis on it.

My issue: I can't seem to run a full query without the MongoDB Compass timing out or running indefinitely - I don't know how to make my database smaller and I can't run my analysis on it without my RAM being overused and my computer crashing.

What I have tried:

  • I have tried using PyMongo since Python is the only language I know but there is not enough documentation so I am getting desperate and using the GUI so Compass
  • I have tried performing my query (simple query like {language : {$eq : "en" } , "user.location" = "USA"} on a smaller database and exporting that to reduce the size of the database and it works! When I try the same thing on my real 32GB size database it either give me a timeout error OR when I increase the max time ms, it runs forever and I can't export anything.
  • I have tried aggregating it in the MongoDB Compass using the $match and $project on my database, but it also times out and I can't figure out how to export it from the aggregation.

Please help me I am genuinely floored all my analysis skills are useless because I can't seem to get to the data because of the sheer size :(

If you have any other tips e.g. don't use MongoDB, use R or Hadoop for windows or smth, please let me know, at this point I'm willing to teach myself anything I can if I can get a grip on this dataset!

Thank you!


Solution

  • Add an index to the fields that you want to query on, and increase the memory etc. in your cluster. To create index fields on your collections use the following shell commands once:

    db.collection.createIndex(
      {
          "language": 1
      },
      {
          unique: false,
      }
    )
    
    
    db.collection.createIndex(
      {
          "user.location": 1
      },
      {
          unique: false,
      }
    )
    

    You don't need to change your query to use the indexes, MonogDB will sort that out for you.