Search code examples
pythondatabase-designdata-analysis

How To Store Changing Data For 20,000 Items In A Database


I have 20,000 different items that I'd like to track. They all have the same attributes: name, height, weight, Assigned Id, and cost (these are just examples). Certain attributes, like name and assigned id, won't change ever. The other attributes change rapidly (5 - 10 times an hour) for some and rarely (1 - 2 times a week or month) for others. I would like a way to track all these changes in a database so that I can perform some analysis on the changes.

Now, I have almost no experience with databases. I have never dealt with this amount of information before and therefore have just gotten by with csv files. Also, I am using python for the code to retrieve the data and will be using it to analyse it as well. Hence, I was thinking that sqlite, which I believe is pretty easy to use with python, might be my best option. I was thinking that I could create a table for each item and have the attributes be the columns and the changes would be on each row. However, I get the feeling that there are probably much better (efficiency and storage size wise) ways to do this.

If somebody could please outline the best way to store this data so that regular retrieval for analysis wouldn't be too difficult, that would really be appreciated.

Also, if you could point me to some resources for doing data analysis on this type of data in python that would be great too.

Thanks in advance


Solution

  • For simple database stuff, SQLAlchemy is your friend: http://www.sqlalchemy.org/

    The documentation includes a fairly comprehensive tutorial that goes through the high-level concepts involved in working with a database, as well as how to design and work with the tables directly in Python.

    Here's a sample, showing how you can define tables in your code. SQLAlchemy will then take care of managing the database, and you can work with them as normal Python objects.

    >>> from sqlalchemy import Column, Integer, String
    >>> class User(Base):
    ...     __tablename__ = 'users'
    ...
    ...     id = Column(Integer, primary_key=True)
    ...     name = Column(String)
    ...     fullname = Column(String)
    ...     password = Column(String)
    ...
    ...     def __repr__(self):
    ...        return "<User(name='%s', fullname='%s', password='%s')>" % (
    ...                             self.name, self.fullname, self.password)
    

    Plus, the tutorial uses an SQLite database!

    I'd also recommend taking a look around the database as you're working on it, to help get a better sense of what's going on under the hood. You didn't mention what environment you're working in, but there are a lot of different options for exploring SQLite databases: sqlite3 in Linux, SQLite Database browser in Windows, etc.

    Update:

    As to the other part of the question, you might want to read up on database normalization.

    The general idea of (basic) normalization is that different pieces of similar information should be kept together in one place, rather than being repeating in lots of different tables.

    From the brief description you've given, my approach would be a Widgets table that would store an ID and name for each item you're tracking, and then a Dimensions table that would list all the changes:

    +-----------+----------------------+--------+--------+-----+
    | widget_id |    change_time       | height | weight | ... |
    +-----------+----------------------+--------+--------+-----+
    |         0 | 05-12-2014 18:30:58  |      4 |     10 | ... |
    |         0 | 06-22-2014 12:32:46  |      4 |     12 | ... |
    |         0 | 06-23-2014 01:02:00  |      6 |     12 | ... |
    |                           ...                            |
    

    Whenever a change occurs, you add an entry to the Dimensions table. That way, you can look up the dimensions for any time in the past, do analysis of changes over time, etc.