Search code examples
pythonsqlsqlitesentence-similarity

How to compare text and select similar sentences in sqlite?


I'm using NLP to extract sentences containing certain keywords from SEC filings across different years. I store the output via pandas dataframe in sqlite. So far so good. The problem comes when I want to compare sentences from two different years, say 2022 and 2021.

I've been using the following query:

query = "select Nvidia_2022.Research as Research_2022, Nvidia_2021.Research as Research_2021 from Nvidia_2022 join Nvidia_2021 where '%' || Nvidia_2022.Research || '%' like '%' || Nvidia_2021.Research || '%'"

This works most of the time for sentences which are exactly the same. Here's the output.

['Such license and development arrangements can further enhance the reach of our technology.'

'Such license and development arrangements can further enhance the reach of our technology.']

Sometimes however, sentences differ slightly, like this:

['We have invested over $29 billion in research and development since our inception, yielding inventions that are essential to modern computing.'

'We have invested over $24 billion in research and development since our inception, yielding inventions that are essential to modern computing.']

$29 billion vs $24 billion

or there are other differences at the end of the sentences:

'Our Compute & Networking segment includes Data Center platforms and systems for AI, HPC, and accelerated computing; Mellanox networking and interconnect solutions; automotive AI Cockpit, autonomous driving development agreements, and autonomous vehicle solutions; cryptocurrency mining processors, or CMP; Jetson for robotics and other embedded platforms; and NVIDIA AI Enterprise and other software.'

'Our Compute & Networking segment includes Data Center platforms and systems for AI, HPC, and accelerated computing; Mellanox networking and interconnect solutions; automotive AI Cockpit, autonomous driving development agreements, and autonomous vehicle solutions; and Jetson for robotics and other embedded platforms.'

My questions:

Is there a way in sqlite or other sql databases to do as much text comparison work as possible, and then pass the most complicated sentences to python to do something like levenshtein_distance or transformers sentence comparison?

Or should I just stop bothering with SQL comparison queries, and get down to heavy lifting in python right away?

I'm trying to utilize as much sql as possible as it tends to be much faster than calculating distances in python.


Solution

  • sqlite3 supports full text search with the FTS5 Extension.

    You have to create a virtual table, and then you can use the MATCH keyword.

    -- create a virtual table
    CREATE VIRTUAL TABLE email USING fts5(sender, title, body);
    
    -- populate it ...
    
    -- perform a full text search
    SELECT * FROM email WHERE email MATCH 'fts5' ORDER BY rank;