Search code examples
indexingdocxextract

.docx file chapter extraction


I would like to extract the content of a .docxfile, chaptervise. So, my .docxdocument has a register and every chapter has some content

 1. Intro
   some text about Intro, these things, those things
 2. Special information
   these information are really special
    2.1 General information about the environment
      environment should be also important
    2.2 Further information 
      and so on and so on

So finally it would be great to receive a Nx3 matrix, containing the index number, the index name and at least the content.

i_number     i_name                 content
1            Intro                  some text about Intro, these things, those things
2            Special Information    these information are really special
... 

Thanks for your help


Solution

  • You could export or copy-paste your .docx in a .txt and apply this R script :

    library(stringr)
    library(readr)
    
    doc <- read_file("filename.txt")
    
    pattern_chapter <- regex("(\\d+\\.)(.{4,100}?)(?:\r\n)", dotall = T)
    
    i_name <- str_match_all(doc, pattern_chapter)[[1]][,1]
    paragraphs <- str_split(doc, pattern_chapter)[[1]]
    content <- paragraphs[-which(paragraphs=="")]
    
    result <- data.frame(i_name, content)
    result$i_number <- seq.int(nrow(result))
    
    View(result)
    

    It doesn't work if your document contains any sort of line which is not a heading beginning with a number (eg, footnotes or numbered lists)

    (please, no mindless downvote : this script works perfectly with the example given)