Search code examples
phpdatabasepostgresqlentity-attribute-value

EAV vs. Column based organization for my data


I'm in the process of rebuilding an application (lone developer here) using PHP and PostgreSQL. For most of the data, I'm storing it using a table with multiple columns for each attribute. However, I'm now starting to build some of the tables for the content storage. The content in this case, is multiple sections that each contain different data sets; some of the data is common and shared (and foreign key'd) and other data is very unique. In the current iteration of the application we have a table structure like this:

id | project_name | project_owner | site | customer_name | last_updated
-----------------------------------------------------------------------
1  | test1        | some guy      | 12   | some company  | 1/2/2012
2  | test2        | another guy   | 04   | another co    | 2/22/2012

Now, this works - but it gets hard to maintain for a few reasons. Adding new columns (happens rarely) requires modifying the database table. Audit/history tracking requires a separate table that mirrors the main table with additional information - which also requires modification if the main table is changed. Finally, there are a lot of columns - over 100 in some tables.

I've been brainstorming alternative approaches, including breaking out one large table into a number of smaller tables. That introduces other issues that I feel also cause problems.

The approach I am currently considering seems to be called the EAV model. I have a table that looks like this:

id | project_name | col_name | data_varchar      | data_int | data_timestamp | update_time
--------------------------------------------------------------------------------------------------
1  | test1        | site     |                   | 12       |                | 1/2/2012
2  | test1        | customer_name | some company |          |                | 1/2/2012
3  | test1        | project_owner | some guy     |          |                | 1/2/2012

...and so on. This has the advantage that I'm never updating, always inserting. Data is never over-written, only added. Of course, the table will eventually grow to be rather large. I have an 'index' table that lists the projects and is used to reference the 'data' table. However I feel I am missing something large with this approach. Will it scale? I originally wanted to do a simple key -> value type table, but realized I need to be able to have different data types within the table. This seems managable because the database abstraction layer I'm using will include a type that selects data from the proper column.

Am I making too much work for myself? Should I stick with a simple table with a ton of columns?


Solution

  • Moving your entire structure to EAV can lead to a lot of problems down the line, but it might be acceptable for the audit-trail portion of your problem since often foreign key relationships and strict datatyping may disappear over time anyway. You can probably even generate your audit tables automatically with triggers and stored procedures.

    Note, however, that reconstructing old versions of records is non-trivial with an EAV audit trail and will require a fair amount of application code. The database will not be able to do it by itself.

    An alternative you could consider is to store all your data (new and old records) in the same table. You can either include audit fields in the same table and leave NULL when unnecessary, or store some rows in the table being "current" and with audit-related fields stored in another table. To simplify your application, you can create a view which only shows current rows and issue queries against the view.

    You can accomplish this with a joined table inheritance pattern. With joined table inheritance, you put common attributes into a base table along with a "type" column, and you can join to additional tables (which have the same primary key which is also a foreign key) based on type. Many Data-Mapper-Pattern ORMs have native support for this pattern, often called "polymorphism".

    You could also use PostgreSQL's native table inheritance mechanism, but note the caveats carefully!