If I have a table of installed equipment with make and model, where the make and model will be duplicated a lot, but with a variety of spellings, etc, how is the best way to avoid wasted space from data duplication?
CREATE TABLE equipment (
id integer NOT NULL,
make character varying(128),
model character varying(128),
lat double precision,
lon double precision,
created timestamp without time zone,
updated timestamp without time zone
);
This table has a lot more fields in reality and will have many millions of rows, and I have other tables in a similar situation totalling about 600 GB of data.
The source data needs to be kept the same (ie. "Panasonic" and "PANASONIC" can't be combined / corrected), and the scale and variety of the data makes that impractical anyway.
I'm envisioning a separate key:value table that stores the values and then the ID is just stored in the equipment table, with a function where I just pass the value and it returns the ID (whether it looks it up and returns the ID or inserts it and returns the new ID).
That would make the tables become:
CREATE TABLE equipment (
id integer NOT NULL,
make integer,
model integer,
lat double precision,
lon double precision,
created timestamp without time zone,
updated timestamp without time zone
);
CREATE TABLE lookup (
id integer NOT NULL,
value character varying(128),
updated timestamp without time zone
);
And interacting with the table would be:
SELECT
id,
lookup_value(make) AS make,
lookup_value(model) AS model,
lat,
lon,
created,
updated
FROM
equipment
INSERT INTO
equipment (id, make, model, created)
VALUES
(nextval('equipment_id_seq'::regclass), lookup_value('Panasonic'), lookup_value('ABC123-G'), NOW())
The lookup table could be reused among a variety of fields and tables, with each string value only appearing once, and the key:value staying the same forever (changing from "Panasonic" and "PANASONIC" wouldn't change the key for "Panasonic", it would return the key for "PANASONIC" instead, inserting if needed).
What are the problems with this approach (aside from code complexity)?
Is there a better approach?
You would never want to have generic lookup table like this. For one it means you cant create a foreign key between the two "value" columns and the IDs, because there is no way of stopping an entry for Make ending up in Model.
As @a_horse_with_no_name said, you would be better to create a model and make table, with FK between them, and then do as you say where you only save a new model or make if it doesn't already exist.
I would also be tempted to have a third column so for all the possible spellings for PANASONIC for example, you have both the lookup row for what they entered, and a reference to what they probably meant. That would assist in cleaning up data going forward. You could suggest in the UI "Did you mean Panasonic" when they enter "Panasoonic" for example.
Coding us up to you, either in a single update, stored proc, or app code.