Search code examples
xmlxpathnlpxquery

extract nodes list between two nodes on Xquery


I work on an NLP project and i need to extract some informations form an XML document. Here is a piece of it. Each node item is a token with parts of speech, tag, lemma...

<basetalismane>
<file type="titre" name="2017/01/01/19-00-00/0,2-3208,1-0,0.xml">
<p type="description">
<item><a>1</a><a>Le</a><a>le</a><a>DET</a><a>DET</a><a>n=s|g=m</a><a>2</a><a>det</a><a>2</a><a>det</a></item>
<item><a>2</a><a>bateau</a><a>bateau</a><a>NC</a><a>NC</a><a>n=s|g=m</a><a>4</a><a>suj</a><a>4</a><a>suj</a></item>
<item><a>3</a><a>se</a><a>se</a><a>CLR</a><a>CLR</a><a>n=p,s|p=3</a><a>4</a><a>aff</a><a>4</a><a>aff</a></item>
<item><a>4</a><a>rendait</a><a>rendre</a><a>V</a><a>V</a><a>n=s|t=I|p=3</a><a>0</a><a>root</a><a>0</a><a>root</a></item>
<item><a>5</a><a>sur</a><a>sur</a><a>P</a><a>P</a><a></a><a>4</a><a>mod</a><a>4</a><a>mod</a></item>
<item><a>6</a><a>l'</a><a>le</a><a>DET</a><a>DET</a><a>n=s</a><a>7</a><a>det</a><a>7</a><a>det</a></item>
<item><a>7</a><a>île</a><a>île</a><a>NC</a><a>NC</a><a>n=s|g=f</a><a>5</a><a>prep</a><a>5</a><a>prep</a></item>
<item><a>8</a><a>de</a><a>de</a><a>P</a><a>P</a><a></a><a>4</a><a>mod</a><a>4</a><a>mod</a></item>
<item><a>9</a><a>Tidung</a><a>_</a><a>NPP</a><a>NPP</a><a></a><a>8</a><a>prep</a><a>8</a><a>prep</a></item>
<item><a>10</a><a>,</a><a>,</a><a>PONCT</a><a>PONCT</a><a></a><a>9</a><a>ponct</a><a>9</a><a>ponct</a></item>
<item><a>11</a><a>destination</a><a>destination</a><a>NC</a><a>NC</a><a>n=s|g=f</a><a>4</a><a>mod</a><a>4</a><a>mod</a></item>
<item><a>12</a><a>touristique</a><a>touristique</a><a>ADJ</a><a>ADJ</a><a>n=s</a><a>11</a><a>mod</a><a>11</a><a>mod</a></item>
<item><a>13</a><a>à</a><a>à</a><a>P</a><a>P</a><a></a><a>11</a><a>dep</a><a>11</a><a>dep</a></item>
<item><a>14</a><a>50</a><a>50</a><a>ADJ</a><a>ADJ</a><a></a><a>15</a><a>mod</a><a>15</a><a>mod</a></item>
<item><a>15</a><a>km</a><a>kilomètre</a><a>NC</a><a>NC</a><a>g=m</a><a>13</a><a>prep</a><a>13</a><a>prep</a></item>
<item><a>16</a><a>de</a><a>de</a><a>P</a><a>P</a><a></a><a>4</a><a>mod</a><a>4</a><a>mod</a></item>
<item><a>17</a><a>Jakarta</a><a>_</a><a>NPP</a><a>NPP</a><a></a><a>16</a><a>prep</a><a>16</a><a>prep</a></item>
<item><a>18</a><a>,</a><a>,</a><a>PONCT</a><a>PONCT</a><a></a><a>17</a><a>ponct</a><a>17</a><a>ponct</a></item>
<item><a>19</a><a>quand</a><a>quand</a><a>CS</a><a>CS</a><a></a><a>4</a><a>mod</a><a>4</a><a>mod</a></item>
<item><a>20</a><a>le</a><a>le</a><a>DET</a><a>DET</a><a>n=s|g=m</a><a>21</a><a>det</a><a>21</a><a>det</a></item>
<item><a>21</a><a>moteur</a><a>moteur</a><a>NC</a><a>NC</a><a>n=s|g=m</a><a>23</a><a>suj</a><a>23</a><a>suj</a></item>
<item><a>22</a><a>a</a><a>avoir</a><a>V</a><a>V</a><a>n=s|t=P|p=3</a><a>23</a><a>aux_tps</a><a>23</a><a>aux_tps</a></item>
<item><a>23</a><a>eu</a><a>avoir</a><a>VPP</a><a>VPP</a><a>n=s|g=m|t=K</a><a>19</a><a>sub</a><a>19</a><a>sub</a></item>
<item><a>24</a><a>des</a><a>des</a><a>DET</a><a>DET</a><a>n=p</a><a>25</a><a>det</a><a>25</a><a>det</a></item>
<item><a>25</a><a>problèmes</a><a>problème</a><a>NC</a><a>NC</a><a>n=p|g=m</a><a>23</a><a>obj</a><a>23</a><a>obj</a></item>
<item><a>26</a><a>,</a><a>,</a><a>PONCT</a><a>PONCT</a><a></a><a>25</a><a>ponct</a><a>25</a><a>ponct</a></item>
<item><a>27</a><a>puis</a><a>puis</a><a>CC</a><a>CC</a><a></a><a>23</a><a>coord</a><a>23</a><a>coord</a></item>
<item><a>28</a><a>a</a><a>avoir</a><a>V</a><a>V</a><a>n=s|t=P|p=3</a><a>29</a><a>aux_tps</a><a>29</a><a>aux_tps</a></item>
<item><a>29</a><a>explosé</a><a>exploser</a><a>VPP</a><a>VPP</a><a>n=s|g=m|t=K</a><a>27</a><a>dep_coord</a><a>27</a><a>dep_coord</a></item>
<item><a>30</a><a>.</a><a>.</a><a>PONCT</a><a>PONCT</a><a></a><a>29</a><a>ponct</a><a>29</a><a>ponct</a></item>
</p>
<p type="description">
<item><a>1</a><a>Il</a><a>il</a><a>CLS</a><a>CLS</a><a>n=s|g=m|p=3</a><a>3</a><a>suj</a><a>3</a><a>suj</a></item>
<item><a>2</a><a>a</a><a>avoir</a><a>V</a><a>V</a><a>n=s|t=P|p=3</a><a>3</a><a>aux_tps</a><a>3</a><a>aux_tps</a></item>
<item><a>3</a><a>annoncé</a><a>annoncer</a><a>VPP</a><a>VPP</a><a>n=s|g=m|t=K</a><a>0</a><a>root</a><a>0</a><a>root</a></item>
<item><a>4</a><a>que</a><a>que</a><a>CS</a><a>CS</a><a></a><a>3</a><a>obj</a><a>3</a><a>obj</a></item>
<item><a>5</a><a>la</a><a>la</a><a>DET</a><a>DET</a><a>n=s|g=f</a><a>6</a><a>det</a><a>6</a><a>det</a></item>
<item><a>6</a><a>reconquête</a><a>reconquête</a><a>NC</a><a>NC</a><a>n=s|g=f</a><a>16</a><a>suj</a><a>16</a><a>suj</a></item>
<item><a>7</a><a>de</a><a>de</a><a>P</a><a>P</a><a></a><a>6</a><a>dep</a><a>6</a><a>dep</a></item>
<item><a>8</a><a>la</a><a>la</a><a>DET</a><a>DET</a><a>n=s|g=f</a><a>10</a><a>det</a><a>10</a><a>det</a></item>
<item><a>9</a><a>"</a><a>"</a><a>PONCT</a><a>PONCT</a><a></a><a>8</a><a>ponct</a><a>8</a><a>ponct</a></item>
<item><a>10</a><a>capitale</a><a>capitale</a><a>NC</a><a>NC</a><a>n=s|g=f</a><a>7</a><a>prep</a><a>7</a><a>prep</a></item>
<item><a>11</a><a>"</a><a>"</a><a>PONCT</a><a>PONCT</a><a></a><a>10</a><a>ponct</a><a>10</a><a>ponct</a></item>
<item><a>12</a><a>autoproclamée</a><a>autoproclamer</a><a>VPP</a><a>VPP</a><a>n=s|g=f|t=K</a><a>10</a><a>mod</a><a>10</a><a>mod</a></item>
<item><a>13</a><a>de</a><a>de</a><a>P</a><a>P</a><a></a><a>12</a><a>mod</a><a>12</a><a>mod</a></item>
<item><a>14</a><a>l'</a><a>le</a><a>DET</a><a>DET</a><a>n=s</a><a>15</a><a>det</a><a>15</a><a>det</a></item>
<item><a>15</a><a>EI</a><a>_</a><a>NPP</a><a>NPP</a><a></a><a>13</a><a>prep</a><a>13</a><a>prep</a></item>
<item><a>16</a><a>était</a><a>être</a><a>V</a><a>V</a><a>n=s|t=I|p=3</a><a>4</a><a>sub</a><a>4</a><a>sub</a></item>
<item><a>17</a><a>"</a><a>"</a><a>PONCT</a><a>PONCT</a><a></a><a>16</a><a>ponct</a><a>16</a><a>ponct</a></item>
<item><a>18</a><a>une</a><a>une</a><a>DET</a><a>DET</a><a>n=s|g=f</a><a>19</a><a>det</a><a>19</a><a>det</a></item>
<item><a>19</a><a>question</a><a>question</a><a>NC</a><a>NC</a><a>n=s|g=f</a><a>16</a><a>obj</a><a>16</a><a>obj</a></item>
<item><a>20</a><a>de</a><a>de</a><a>P</a><a>P</a><a></a><a>19</a><a>dep</a><a>19</a><a>dep</a></item>
<item><a>21</a><a>semaines</a><a>semaine</a><a>NC</a><a>NC</a><a>n=p|g=f</a><a>20</a><a>prep</a><a>20</a><a>prep</a></item>
<item><a>22</a><a>"</a><a>"</a><a>PONCT</a><a>PONCT</a><a></a><a>21</a><a>ponct</a><a>21</a><a>ponct</a></item>
<item><a>23</a><a>.</a><a>.</a><a>PONCT</a><a>PONCT</a><a></a><a>21</a><a>ponct</a><a>21</a><a>ponct</a></item>
<item><a>24</a><a>§</a><a>§</a><a>PONCT</a><a>PONCT</a><a></a><a>21</a><a>ponct</a><a>21</a><a>ponct</a></item>
</p>
</description>

I work on syntactic dependencies. Here you can see that nodes item are tokens (with parts of speech tag etc... My task is to target item with a[8]='sub'. After that, i need to extract the words in relation between. It's a[9]. It's the index of the beginning of the syntactic dependence. In the first sentence (description node), the sub item is

<item><a>23</a><a>eu</a><a>avoir</a><a>VPP</a><a>VPP</a><a>n=s|g=m|t=K</a><a>19</a><a>sub</a><a>19</a><a>sub</a></item>

I need to extract his a[9] (here is 19). In fact, it's the index of the first word of my syntactic dependecie. This is this item (basing on index a[1])

<item><a>19</a><a>quand</a><a>quand</a><a>CS</a><a>CS</a><a></a><a>4</a><a>mod</a><a>4</a><a>mod</a></item>

What i have to do ? get all items (in fact a[2] value between the index of this word and my item with 'sub'. In the first sentence, the following output would be

quand le moteur a eu

it's an extraction of nodes between two nodes with index. But here is my following code. I can't grab the items nodes between each other item. Be careful, it may have more than one sub item by sentence so i needed to add a for loop

for $p in /basetalismane/file/*//p
let $items := /$p//item[a[8]='sub']
for $p in /basetalismane/file/*//p
let $items := /$p//item[a[8]='sub']
for $item in $items
let $target := /$item/a[9]
let $source := /$item/a[1]
return (
for $i in ($target to $source)
return string-join( $p/item[$i]/a[2]  , ' '))

I get only each word but not the sequence. I can't concatenate strings one word by one. i've done a return $nodes to see what i grab. It's only sub items. I want the item between. I would like a list of item or a string with their a[2] to have the words. In the second sentence, the following output would be

que la reconquete autoproclamée de la "capitale" de l'EI était

Thx for your help. I hope it's clear for you guys but it's hard to explains (i'm a french guy)


Solution

  • I think

    declare namespace output = "http://www.w3.org/2010/xslt-xquery-serialization";
    
    declare option output:method 'text';
    declare option output:item-separator '&#10;';
    
    for $item in //p[@type = 'description']/item[a[8] = 'sub']
    return 
        string-join(
          $item/parent::p/item[a[1] = $item/a[9]]/
            (., 
            let $next := following-sibling::item[a[8] = 'sub'][1] 
            return (following-sibling::item[. << $next], $next))/a[2],
          ' '
        )
    

    gives

    quand le moteur a eu
    que la reconquête de la " capitale " autoproclamée de l' EI était
    

    Perhaps windowing or fold-right can also help express it.