I would like to extract the content of a .docx
file, chaptervise.
So, my .docx
document has a register and every chapter has some content
1. Intro
some text about Intro, these things, those things
2. Special information
these information are really special
2.1 General information about the environment
environment should be also important
2.2 Further information
and so on and so on
So finally it would be great to receive a Nx3
matrix, containing the index number, the index name and at least the content.
i_number i_name content
1 Intro some text about Intro, these things, those things
2 Special Information these information are really special
...
Thanks for your help
You could export or copy-paste your .docx in a .txt and apply this R script :
library(stringr)
library(readr)
doc <- read_file("filename.txt")
pattern_chapter <- regex("(\\d+\\.)(.{4,100}?)(?:\r\n)", dotall = T)
i_name <- str_match_all(doc, pattern_chapter)[[1]][,1]
paragraphs <- str_split(doc, pattern_chapter)[[1]]
content <- paragraphs[-which(paragraphs=="")]
result <- data.frame(i_name, content)
result$i_number <- seq.int(nrow(result))
View(result)
It doesn't work if your document contains any sort of line which is not a heading beginning with a number (eg, footnotes or numbered lists)
(please, no mindless downvote : this script works perfectly with the example given)