Search code examples
postgresqlwindow-functionslinguistics

PostgreSQL combine LAG and LEAD to query n previous and following rows


I have a PostgreSQL table, let's call it tokens, containing grammatical annotations per token in lines of text, basically like this:

idx | line | tno | token   | annotation      | lemma
----+------+-----+---------+-----------------+---------
  1 | I.01 | 1   | This    | DEM.PROX        | this
  2 | I.01 | 2   | is      | VB.COP.3SG.PRES | be
  3 | I.01 | 3   | an      | ART.INDEF       | a
  4 | I.01 | 4   | example | NN.INAN         | example

I want to make a query that allows me to search for grammatical contexts, in this case, a query that checks whether a certain annotation is present in a window of size n before and after the current row. From what I read up on this, PostgreSQL's Window Functions LEAD and LAG are suitable to achieve this. As a first shot, I wrote the following query based on documentation I could find about these functions:

SELECT *
FROM (
    SELECT token, annotation, lemma,
        -- LAG(annotation) OVER prev_rows AS prev_anno, -- ?????
        LEAD(annotation) OVER next_rows AS next_anno
    FROM tokens
    WINDOW next_rows AS (
        ORDER BY line, tno ASC
        ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
    )
    ORDER BY line, tno ASC
) AS "window"
WHERE
    lemma LIKE '...'
    AND "window".next_anno LIKE '...'
;

However, this only searches the 2 following rows. My question is, how can I rephrase the query to make the window include both previous and following rows in the table? Apparently, I can't have 2 WINDOW statements or do something like

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
AND ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING

Solution

  • I am not quiet sure if I got your use case correctly: You want to check if one given annotation is in one of the 5 rows (2 preceding, current, 2 following). Correct?


    1. It is possible to define a window like BETWEEN 2 PRECEDING AND 2 FOLLOWING
    2. LEAD or LAG only give one value, in this case the one value after or before the current row - if a window supports it; no matter how many rows your window contains. But you want to check in any of these five rows.

    One way to achieve this:


    demo: db<>fiddle

    SELECT *
    FROM (
        SELECT token, annotation, lemma,
            unnest(array_agg(annotation) OVER w) as surrounded_annos      -- 2
        FROM tokens
        WINDOW w AS (                                                     -- 1
            ORDER BY line, tno ASC
            ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
        )
        ORDER BY line, tno ASC
    ) AS "window"
    WHERE
        lemma LIKE '...'
        AND "window".surrounded_annos LIKE '...'
    ;
    
    1. defining the window as explained above
      1. aggregate all annotations in these five rows (if possible) with array_agg which gives an array
      2. unnest expands this array into one row for every element since IMHO there is no way to search array elements with LIKE. This gives you this result (which can be filtered in the next step):

    Result subquery:

    token     annotation        lemma     surrounded_annos
    This      DEM.PROX          this      DEM.PROX
    This      DEM.PROX          this      VB.COP.3SG.PRES
    This      DEM.PROX          this      ART.INDEF
    is        VB.COP.3SG.PRES   be        DEM.PROX
    is        VB.COP.3SG.PRES   be        VB.COP.3SG.PRES
    is        VB.COP.3SG.PRES   be        ART.INDEF
    is        VB.COP.3SG.PRES   be        NN.INAN
    an        ART.INDEF         a         DEM.PROX
    an        ART.INDEF         a         VB.COP.3SG.PRES
    an        ART.INDEF         a         ART.INDEF
    an        ART.INDEF         a         NN.INAN
    example   NN.INAN           example   VB.COP.3SG.PRES
    example   NN.INAN           example   ART.INDEF
    example   NN.INAN           example   NN.