I am trying to create a new ID for a dataset which will identify and group people together when certain conditions are met, for example a shared address. I'm trying to add an autoincrement column that creates this ID for those people, so I could see person x and y have the same address so they get the same new_id, but person z may not share an address with anyone and so would have the new_id field left blank.
I think my syntax is wrong and am not sure that you can do ALTER TABLE [] ADD [] AS SELECT [] WHERE []
. Can someone tell me where I'm going wrong here or how to do this properly? Sample code below. Apologies if the code formatting is bad, stackoverflow is ignoring my line breaks.
ALTER TABLE people_tab ADD new_id int autoincrement(1,1)
AS SELECT *
FROM people_tab a
INNER JOIN people_tab b
ON LEFT(a.zip,5) = LEFT(b.zip,5) and a.id != b.id
where LEN(a.address) >=10 and and LEN(b.address) >=10
Since addresses are shared, and not people, you should create a new table of addresses and have the auto-increment applied there for each unique address.
The quality of your address sources will play heavily into this - if you are using an address lookup service to improve your data quality, then you're headed in the right direction. Otherwise, you may be heading into a game of whack-a-mole correcting everything from misspellings, case insensitivity, variations on standard names (like RD, Road, Rd.) that mean the same thing, and others.
Assuming your data is "clean", then I would create a table with address and zip columns:
create or replace table address_master
(id int autoincrement(1,1),
zip string,
address string );
insert into address_master (zip, address)
select distinct zip, address from people_tab;
Now you can join your people_tab on address_master and get the id associated with all people matching the same address.