I am trying to use multithreading to insert raw_html into postgres but I can't get the data to insert and keep getting
LINE 1: INSERT INTO raw_html VALUES b'<!DOCTYPE html><html lang="en"...
^
Error: syntax error at or near "b'<!DOCTYPE html
I can't figure out what the b
is doing in front of each of these strings (nor can I find anything online explaining what the b is) and I am assuming that's why the insert is failing. I also am trying to get the raw_html string to convert to a tuple because I think it needs to be a tuple for sql to accept it. I can't figure out what to try next to fix these two issues. Any direction appreciated:
UPDATE:
def execute_values(LST):
DB_HOST = "localhost"
DB_NAME = "yellow_pages_scraper"
DB_USER = "justinbenfit"
DB_PASS = "postgres"
DB_URL = "mydburl"
conn = psycopg2.connect(dbname=DB_NAME, user=DB_USER, password=DB_PASS, host=DB_HOST)
cur = conn.cursor()
try:
LST = tuple(LST)
for L in LST:
cur.execute("INSERT INTO raw_html VALUES (?)", (L.decode('utf-8'),))
conn.commit()
except (Exception, psycopg2.DatabaseError) as error:
print("Error: %s" % error)
conn.rollback()
return 1
print(f"the dataframe is inserted")
def main(LINKS):
t0 = time.time()
download_raw(LINKS)
execute_values(LST)
t1 = time.time()
print(f"{t1-t0} seconds to download {len(LST)} raw_html.")
There are several problems. First you are sending a bytes
string instead of a Unicode string. Second, you are not doing any quoting. Third, your INSERT syntax is wrong; the values need to be in parens. Fourth, you should be allowing the database to do the quoting for you, to prevent SQL arracks.
This should work:
cur.execute("INSERT INTO raw_html VALUES (?)", (L.decode('utf-8'),)
However, your whole concept is flawed. You have each thread opening a new connection into the database. That's horrible. The database insertion is not your bottleneck here. You should be able to do this serially. Just delay the commit
until the whole set is done.