Search code examples
sqlsql-serverquery-performance

SQL Server inefficient query (server full load)


We are currently scraping structured data from a variety of different sources. Before ingesting new data into our table, we check to see if the data_id exists already.

IF NOT EXISTS (SELECT TOP 1 * FROM TABLE_NAME WHERE DATA_ID=@P0)

We have no indexes; however, we have a PK set for our id column which seems unnecessary, should we remove this to improve insert speed?

Our server is currently at full load checking through 3 million or so worth of data to make sure we are not inserting duplicate data. We have tried upgrading our SQL Server for higher DTU but that doesn't seem to help at all.

When we have multiple jobs running at the same time checking for unique data or SQL Server comes to a crawl and insert speed takes forever.

Should we get rid of this unique data check and create new tables for every scraping job, then use a SQL Query to compare the differences, such as new data or data that was removed?

Query used for conditional insertion:

String sql = "IF NOT EXISTS (SELECT TOP 1 * FROM A_PROV_CVV_LDG_1 WHERE DATA_ID=?) " +
            "INSERT INTO A_PROV_CVV_LDG_1 (DATA_ID, SourceID, BASE_ID, BIN, BANK, CARD_TYPE, CARD_CLASS," +
            " CARD_LEVEL, CARD_EXP, COUNTRY, STATE, CITY, ZIP, DOB, SSN, EMAIL, PHONE, GENDER, ADDR_LINE_1, ADDR_LINE_2," +
            " FIRST_NAME, LAST_NAME, DateAddedToMarket, PRICE) " +
            "VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)";

This is the entire table definition, no indexes only PK is 'id' which seems unnecessary.

+-------------------+--------------+-----------+ | (PK)id | int | Unchecked | | DATA_ID | int | Checked | | SourceID | int | Checked | | BASE_ID | varchar(255) | Checked | | BIN | varchar(255) | Checked | | BANK | varchar(255) | Checked | | CARD_TYPE | varchar(255) | Checked | | CARD_CLASS | varchar(255) | Checked | | CARD_LEVEL | varchar(255) | Checked | | CARD_EXP | varchar(255) | Checked | | COUNTRY | varchar(255) | Checked | | STATE | varchar(255) | Checked | | CITY | varchar(255) | Checked | | ZIP | varchar(255) | Checked | | DOB | varchar(255) | Checked | | SSN | varchar(255) | Checked | | EMAIL | varchar(255) | Checked | | PHONE | varchar(255) | Checked | | GENDER | varchar(255) | Checked | | ADDR_LINE_1 | varchar(255) | Checked | | ADDR_LINE_2 | varchar(255) | Checked | | FIRST_NAME | varchar(255) | Checked | | LAST_NAME | varchar(255) | Checked | | PRICE | varchar(255) | Checked | | DateAddedToMarket | varchar(255) | Checked | | DateAdded | datetime | Unchecked | +-------------------+--------------+-----------+


Solution

  • You absolutely need a unique index on DATA_ID for your query--indeed for any deduplication attempt on DATA_ID---to work efficiently. Without it every attempted insert does a full table scan.

    Yes, indexes slow down insertion a little bit. But an index on an integer column isn't very expensive. Certainly not compared to the mess you're in now with a table scan for every insertion. Create that index.