I am trying to do some fuzzy matching between two tables. One is a table I have stored locally (9,000 rows), call it table A. The other is stored as a sqlite db (2 million + rows csv), call it table B. Basically, I want to match the column "CompanyNames" from table A with the column "CurrentEntityNames" from table B and use this to left join table B to table A.
I am currently able to loop through the LIKE statements, passing a parameter like so: (myNames is just the column CompanyNames from table A as a list).
for index, name in enumerate(myNames):
sql = 'SELECT * from "table" WHERE CurrentEntityName LIKE ?;'
param =(name + '%%',)
df = pd.read_sql_query(sql,engine, params=param)
myresponses[index] = df
However, I have two questions: 1. I realize maybe querying for each row in table A is not very efficient as the goal is to minimize interaction with the db. In the case that it's better to restructure to have less queries, how would I do that? 2. Would adding an index based on CurrentEntityName make this faster?
For approach 2., I tried to add the index using (found in another stackoverflow answer)
meta = sqlalchemy.MetaData()
meta.reflect(bind=engine)
table = meta.tables['table']
my_index = sqlalchemy.Index('nameIds', table.columns.get('CurrentEntityName'))
but I'm not sure how to implement this when querying.
For approach 1., I've seen some examples using conn and cursor but actually I'm not sure how to use these in conjunction with a database created from the engine. (I loaded my data using
for df in pd.read_csv("C://Users//SEAB//Downloads//Active_Corporations___Beginning_1800.csv", chunksize = chunksize, iterator = True):
df = df.rename(columns={c:c.replace(' ', '') for c in df.columns})
df.index +=j
i+=1
df.to_sql('table', engine, if_exists = 'append')
j= df.index[-1] + 1
found in this tutorial [https://plot.ly/python/v3/big-data-analytics-with-pandas-and-sqlite/]
Basically, the query is still really slow (taking maybe more than 1 hour for 9000 rows). I really appreciate any advice or help. I'm new to sqlite3 so there's a lot I don't know. Thank you for your understanding.
The rules for when Sqlite can use a index with LIKE
are many, but it can do it.
Essentially, given the default case-insensitive behavior: You need a column with TEXT
affinity on the left hand side. The right hand side needs to be a string literal (or, if the statement is compiled with sqlite3_prepare_v2()
, a parameter bound to a string) of the format 'XXX%' - that is, a leading constant value before any wildcards. Given that, if there's a case-insensitive index on the left hand column, it can rewrite the query to use that index instead of looking at every single row.
Some examples from an interactive session:
sqlite> CREATE TABLE ex(col1 TEXT, col2 TEXT COLLATE NOCASE);
sqlite> CREATE INDEX ex_col1_idx ON ex(col1);
sqlite> CREATE INDEX ex_col2_idx ON ex(col2);
sqlite> EXPLAIN QUERY PLAN SELECT * FROM ex WHERE col1 LIKE 'foo%';
QUERY PLAN
`--SCAN TABLE ex
sqlite> EXPLAIN QUERY PLAN SELECT * FROM ex WHERE col2 LIKE 'foo%';
QUERY PLAN
`--SEARCH TABLE ex USING INDEX ex_col2_idx (col2>? AND col2<?)
sqlite> CREATE INDEX ex_col1_idx_nocase ON ex(col1 COLLATE NOCASE);
sqlite> EXPLAIN QUERY PLAN SELECT * FROM ex WHERE col1 LIKE 'foo%';
QUERY PLAN
`--SEARCH TABLE ex USING INDEX ex_col1_idx_nocase (col1>? AND col1<?)
sqlite> .parameter init
sqlite> .parameter set ?1 'foo%'
sqlite> EXPLAIN QUERY PLAN SELECT * FROM ex WHERE col1 LIKE ?;
QUERY PLAN
`--SEARCH TABLE ex USING INDEX ex_col1_idx_nocase (col1>? AND col1<?)
As you can see, the indexed column being searched needs to explicitly be given a case-insensitive collation in the table definition, or have an explicitly case-insensitive index.
The big potential for things going bad in your case is how the Python sqlite bindings prepare the statements used with execute
methods - does it use the old sqlite3_prepare()
or the newer sqlite3_prepare_v2()
API? If I'm looking at the right source file it uses the latter, so that shouldn't be an issue.