Search code examples
postgresqltsvector

Extract PostgreSQL text data proximate to keyword


I am working on a system that writes text logs to a text column in PostgreSQL 12. I want to perform a simple search on a keyword (this part's easy!), but then I want to extract only a section of the text content that appears proximate to the located search keyword (to provide context).

The logs contain lines separated by carriage returns. Say I want to search for the keyword "exception" and bring back each line in the text column that contains the keyword. Or, maybe I want to return the line with the keyword as well as the line before and the line afterwards.

Is this possible using PostgreSQL query features alone? Or will I need to write other custom code to extract just the lines I'm interested in. (BTW, I don't currently use the tsvector feature.)

Any help would be greatly appreciated. Thanks!


[Edited to provide additional details.]

Say I have a table called LOG_RECORDS that contains a TEXT column called LOG_CONTENT. One particular row of this table contain the following in LOG_CONTENT:

2021-05-28 13:34:20.6101-04 - Loading Configurations
2021-05-28 13:34:31.6656-04 - Settings Initialization Complete: Test Device Configuration
2021-05-28 13:34:31.6656-04 - RecordWriter:Initialize
2021-05-28 13:34:32.3982-04 - Migrating configuration database data from schema version 21 to 21
2021-05-28 13:34:32.4590-04 - Add User successful: Export
2021-05-28 13:34:44.6605-04 - SerialConnectionManagedReadBufferBase - ReadTask - System.InvalidOperationException: The operation is not allowed on non-connected sockets.
                                 at InTheHand.Net.Bluetooth.Msft.SocketBluetoothClient.GetStream()
                                 at Sensus.CommunicationDevices.SerialConnections.SerialConnectionBT.ReadInternal(Byte[] buffer, Int32 maxBytes)
                                 at Sensus.CommunicationDevices.SerialConnections.SerialConnectionManagedReadBufferBase.ReadTask()
2021-05-28 13:34:47.1807-04 - Communicator connection failed
2021-05-28 13:37:36.5337-04 - Linking
2021-05-28 13:37:37.2534-04 - Operation complete

To search for the keyword "exception" I can simply query the following:

SELECT * FROM LOG_RECORDS WHERE UPPER(LOG_CONTENT) LIKE '%EXCEPTION%';

But I would like to get back just the following (which represents the line containing the keyword "exception" plus the lines before and after), rather than the entire TEXT value:

2021-05-28 13:34:32.4590-04 - Add User successful: Export
2021-05-28 13:34:44.6605-04 - SerialConnectionManagedReadBufferBase - ReadTask - System.InvalidOperationException: The operation is not allowed on non-connected sockets.
                                 at InTheHand.Net.Bluetooth.Msft.SocketBluetoothClient.GetStream()

If that's too challenging, just getting back the line with the keyword would be great:

2021-05-28 13:34:44.6605-04 - SerialConnectionManagedReadBufferBase - ReadTask - System.InvalidOperationException: The operation is not allowed on non-connected sockets.

Hopefully this makes sense. Thank you!


Solution

  • Here is a solution but not very performant:

    select t.id, t2.arr || E'\n' ||t1.arr from log_records t 
    cross join lateral unnest(string_to_array(log_content,E'\n')) with ordinality t1(arr,e)
    join lateral unnest(string_to_array(log_content,E'\n')) with ordinality t2(arr,e) on t1.e=t2.e+1
    where t1.arr ~ 'Exception'
    and t.log_content ~ 'Exception'
    

    DEMO

    Also AFAIK use of Full text search is not possible in this case because your log does not contain Exception as a separate word or prefix of a word. It is present as a suffix or middle of a word and ts_query support only full text search or prefix search.

    Also add GIN index on you column to make the search faster