Search code examples
rtextlemmatizationread-data

Reading text file with abnormal delimitor


I am using an algorithm to lemmatize a text vector. The output is a .txt file stored in the way shown in the picture below. output

The original word is listed in the first column, whilst the various lemmas are listed in the second column, followed by some grammatical classifications. I want to read this into R, but have no idea how to do this. I have tried various forms of separators, but none seem to work.

Ideally, I want the data frame in R to look as follows, where I only read the first occurence of each lemma:

wanted structure

Perhaps the best option could be to read the data, keep only the first occurence (ie. da da adv), then do something like text to columns and only keep the first two columns.

Output from lemmatization algorithm:

"<da>"
    "da" adv
    "da" sbu
    "da" subst fork
"<dette>"
    "dette" det dem nøyt ent
    "dette" pron nøyt ent pers 3
    "dette" verb inf
"<er>"
    "være" verb pres <aux1/perf_part>
"<den>"
    "den" det dem fem ent
    "den" det dem mask ent
    "den" pron mask fem ent pers 3

Wanted structure:

da      da 
dette   dette
er  være
den den

Solution

  • Here's an interesting result: You can read the file quite nicely with read.table:

    s <- '"<da>"
        "da" adv
        "da" sbu
        "da" subst fork
    "<dette>"
        "dette" det dem nøyt ent
        "dette" pron nøyt ent pers 3
        "dette" verb inf
    "<er>"
        "være" verb pres <aux1/perf_part>
    "<den>"
        "den" det dem fem ent
        "den" det dem mask ent
        "den" pron mask fem ent pers 3
     '
    
     x <- read.table(sep='', text=s, colClasses=c('character','character'), flush=TRUE, fill=TRUE)
    
    > x
            V1    V2   V3
    1     <da>           
    2       da   adv     
    3       da   sbu     
    4       da subst fork
    5  <dette>           
    6    dette   det  dem
    7    dette  pron nøyt
    8    dette  verb  inf
    9     <er>           
    10    være  verb pres
    11   <den>           
    12     den   det  dem
    13     den   det  dem
    14     den  pron mask
    

    Using packages dplyr and tidyr, we can unpack it into:

    (y <- x %>% mutate(a=grepl('<', V1, fixed=TRUE), b=cumsum(a)) %>% 
      group_by(b) %>% 
      summarise(verbs=list(t(unique(V1)))) %>% 
      unnest(cols=c(verbs)))
    # A tibble: 4 x 2
          b verbs[,1] [,2] 
      <int> <chr>     <chr>
    1     1 <da>      da   
    2     2 <dette>   dette
    3     3 <er>      være 
    4     4 <den>     den  
    
    result <- y$verbs
     result[,1] <- gsub('(<|>)', '', result[,1])
    
    
        [,1]    [,2]   
    [1,] "da"    "da"   
    [2,] "dette" "dette"
    [3,] "er"    "være" 
    [4,] "den"   "den"