I have a PostgreSQL table, let's call it tokens, containing grammatical annotations per token in lines of text, basically like this:
idx | line | tno | token | annotation | lemma
----+------+-----+---------+-----------------+---------
1 | I.01 | 1 | This | DEM.PROX | this
2 | I.01 | 2 | is | VB.COP.3SG.PRES | be
3 | I.01 | 3 | an | ART.INDEF | a
4 | I.01 | 4 | example | NN.INAN | example
I want to make a query that allows me to search for grammatical contexts, in this case, a query that checks whether a certain annotation is present in a window of size n before and after the current row. From what I read up on this, PostgreSQL's Window Functions LEAD
and LAG
are suitable to achieve this. As a first shot, I wrote the following query based on documentation I could find about these functions:
SELECT *
FROM (
SELECT token, annotation, lemma,
-- LAG(annotation) OVER prev_rows AS prev_anno, -- ?????
LEAD(annotation) OVER next_rows AS next_anno
FROM tokens
WINDOW next_rows AS (
ORDER BY line, tno ASC
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
)
ORDER BY line, tno ASC
) AS "window"
WHERE
lemma LIKE '...'
AND "window".next_anno LIKE '...'
;
However, this only searches the 2 following rows. My question is, how can I rephrase the query to make the window include both previous and following rows in the table? Apparently, I can't have 2 WINDOW
statements or do something like
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
AND ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
I am not quiet sure if I got your use case correctly: You want to check if one given annotation is in one of the 5 rows (2 preceding, current, 2 following). Correct?
BETWEEN 2 PRECEDING AND 2 FOLLOWING
LEAD
or LAG
only give one value, in this case the one value after or before the current row - if a window supports it; no matter how many rows your window contains. But you want to check in any of these five rows.One way to achieve this:
SELECT *
FROM (
SELECT token, annotation, lemma,
unnest(array_agg(annotation) OVER w) as surrounded_annos -- 2
FROM tokens
WINDOW w AS ( -- 1
ORDER BY line, tno ASC
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING
)
ORDER BY line, tno ASC
) AS "window"
WHERE
lemma LIKE '...'
AND "window".surrounded_annos LIKE '...'
;
array_agg
which gives an arrayunnest
expands this array into one row for every element since IMHO there is no way to search array elements with LIKE
. This gives you this result (which can be filtered in the next step):Result subquery:
token annotation lemma surrounded_annos
This DEM.PROX this DEM.PROX
This DEM.PROX this VB.COP.3SG.PRES
This DEM.PROX this ART.INDEF
is VB.COP.3SG.PRES be DEM.PROX
is VB.COP.3SG.PRES be VB.COP.3SG.PRES
is VB.COP.3SG.PRES be ART.INDEF
is VB.COP.3SG.PRES be NN.INAN
an ART.INDEF a DEM.PROX
an ART.INDEF a VB.COP.3SG.PRES
an ART.INDEF a ART.INDEF
an ART.INDEF a NN.INAN
example NN.INAN example VB.COP.3SG.PRES
example NN.INAN example ART.INDEF
example NN.INAN example NN.