Search code examples
rdocxextract

how to extract plain text from .docx file using R


Anyone know of anything they can recommend in order to extract just the plain text from an article with in .docx format (preferable with R) ?

Speed isn't crucial, and we could even use a website that has some API to upload and extract the files but i've been unable to find one. I need to extract the introduction, the method, the result and the conclusion I want to delete the abstract, the references, and specially the graphics and the table thanks


Solution

  • You can try to use readtext library:

    library(readtext)
    x <- readtext("/path/to/file/myfile.docx")
    # x$text will contain the plain text in the file
    

    Variable x contains just the text without any formatting, so if you need to extract some information you need to perform string search. For example for the document you mentioned in your comment, one approach could be as follows:

    library(readtext)
    doc.text <- readtext("test.docx")$text
    
    # Split text into parts using new line character:
    doc.parts <- strsplit(doc.text, "\n")[[1]]
    
    # First line in the document- the name of the Journal
    journal.name <- doc.parts[1]
    journal.name
    # [1] "International Journal of Science and Research (IJSR)"
    
    # Similarly we can extract some other parts from a header
    issn <-  doc.parts[2]
    issue <- doc.parts[3]
    
    # Search for the Abstract:
    abstract.loc <- grep("Abstract:", doc.parts)[1]
    
    # Search for the Keyword
    Keywords.loc <- grep("Keywords:", doc.parts)[1]
    
    # The text in between these 2 keywords will be abstract text:
    abstract.text <- paste(doc.parts[abstract.loc:(Keywords.loc-1)], collapse=" ")
    
    # Same way we can get Keywords text:
    Background.loc <- Keywords.loc + grep("1\\.", doc.parts[-(1:Keywords.loc)])[1]
    Keywords.text <- paste(doc.parts[Keywords.loc:(Background.loc-1)], collapse=" ")
    Keywords.text
    # [1] "Keywords: Nephronophtisis, NPHP1 deletion, NPHP4 mutations, Tunisian patients"
    
    # Assuming that Methods is part 2
    Methods.loc <- Background.loc + grep("2\\.", doc.parts[-(1:Background.loc)])[1]
    Background.text <- paste(doc.parts[Background.loc:(Methods.loc-1)], collapse=" ")
    
    
    # Assuming that Results is Part 3
    Results.loc <- Methods.loc- + grep("3\\.", doc.parts[-(1:Methods.loc)])[1]
    Methods.text <- paste(doc.parts[Methods.loc:(Results.loc-1)], collapse=" ")
    
    # Similarly with other parts. For example for Acknowledgements section:
    Ack.loc <- grep("Acknowledgements", doc.parts)[1]
    Ref.loc <- grep("References", doc.parts)[1]
    Ack.text <- paste(doc.parts[Ack.loc:(Ref.loc-1)], collapse=" ")
    Ack.text
    # [1] "6. Acknowledgements We are especially grateful to the study participants. 
    # This study was supported by a grant from the Tunisian Ministry of Health and 
    # Ministry of Higher Education ...
    

    The exact approach depends on the common structure of all the documents you need to search through. For example if the first section is always named "Background" you can use this word for your search. However if this could sometimes be "Background" and sometimes "Introduction" then you might want to search for "1." pattern.