Search code examples
pythontensorflow

UPOS Mappings - Tensorflow Datasets TDFS


I am using the tensorflow tdfs dataset extreme/pos which I retrieve using the code below. It is annotated with universal part of speech POS labels. These are int values. Its fairly easy to map them back to their part of speech by creating my own mapping (0 = ADJ, 7 = NOUN, etc.) but I was wondering if there is a way of retrieving these class mappings from the tdfs dataset?

(orig_train, orig_dev, orig_test), ds_info = tfds.load(
'xtreme_pos/xtreme_pos_en',
split=['train', 'dev', 'test'],
shuffle_files=True,
with_info=True
)

Solution

  • One way is to dig into Tensorflow code to see where is defined the list of POS and then import it to use in your code. You can find the list of the POS in the Github code of tensorflow Datasets there (UPOS constant): https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builders/conll/conllu_dataset_builder_utils.py#L31

    The item order is their index so with display(pd.Series(UPOS)), you get:

    res


    Another way would be to extract the items from the upos column of tfds.as_dataframe (taking a few rows, concatenating the upos values, splitting by the separating character and taking the set() to get the unique values.