Search code examples
rnlptokensentence

How to interpret values of feats when using udpipe and R


In R udpipe package, if we code like:

library(udpipe)
x <- udpipe("The economy is weak but the outlook is bright. the property market will be booming next year", "english")

The result is:

  doc_id paragraph_id sentence_id                                      sentence start end term_id token_id   token   lemma  upos
1   doc1            1           1 The economy is weak but the outlook is bright     1   3       1        1     The     the   DET
2   doc1            1           1 The economy is weak but the outlook is bright     5  11       2        2 economy economy  NOUN
3   doc1            1           1 The economy is weak but the outlook is bright    13  14       3        3      is      be   AUX
4   doc1            1           1 The economy is weak but the outlook is bright    16  19       4        4    weak    weak   ADJ
5   doc1            1           1 The economy is weak but the outlook is bright    21  23       5        5     but     but CCONJ
6   doc1            1           1 The economy is weak but the outlook is bright    25  27       6        6     the     the   DET
7   doc1            1           1 The economy is weak but the outlook is bright    29  35       7        7 outlook outlook  NOUN
8   doc1            1           1 The economy is weak but the outlook is bright    37  38       8        8      is      be   AUX
9   doc1            1           1 The economy is weak but the outlook is bright    40  45       9        9  bright  bright   ADJ
  xpos                                                 feats head_token_id dep_rel deps            misc
1   DT                             Definite=Def|PronType=Art             2     det <NA>            <NA>
2   NN                                           Number=Sing             4   nsubj <NA>            <NA>
3  VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             4     cop <NA>            <NA>
4   JJ                                            Degree=Pos             0    root <NA>            <NA>
5   CC                                                  <NA>             9      cc <NA>            <NA>
6   DT                             Definite=Def|PronType=Art             7     det <NA>            <NA>
7   NN                                           Number=Sing             9   nsubj <NA>            <NA>
8  VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             9     cop <NA>            <NA>
9   JJ                                            Degree=Pos             4    conj <NA> SpacesAfter=\\n

I have a read through https://universaldependencies.org/ext-feat-index.html. But still, I cannot understand what feats means here?


Solution

  • These are morphological features of the words. Examples are gender, number, and case for nouns; person, number, aspect for verbs, etc.

    This part of Universal Dependencies annotation is not universal at all. The page you referenced contains all morphological features that can appear in all languages that are in UD. Most of them are not applicable to most languages, some phenomena might appear multiple times under different names in different treebanks. To make the situation even trickier, some treebanks UDPipe ist trained do not contain the morphological features at all. UDPipe then of course only contains what it can learn from the treebanks.

    UD contains six different treebanks for English and therefore there are six different models in UDPipe as well. There is an overview at the UD webpage that explains how the treebanks differ and also explains the morphological features that are used for English. The default for English is UD_English-EWT.