Search code examples
rquanteda

Use of wildcards with readtext()


A basic question. I have a bunch of transcripts (.docx files) I want to read into a corpus. I use readtext() to read in single files no problem.

dat <- readtext("~/ownCloud/NLP/interview_1.docx")

As soon as I put "*.docx" in my readtext statement it spits an error.

dat <- readtext("~/ownCloud/NLP/*.docx")

Error: '/var/folders/bl/61g7ngh55vs79cfhfhnstd4c0000gn/T//RtmpWD6KSx/readtext-aa71916b691c0cf3cabc73a2e04a45f7/word/document.xml' does not exist.
In addition: Warning message:
In utils::unzip(file, exdir = path) : error 1 in extracting from zip file

Why the reference to a zip file? I have only .docx files in the directory.


Solution

  • I was able to reproduce the same problem. The issue was there are some hidden/temp .docx files in that folder, if you delete them and then try the code it works.

    To see the hidden files, go to the folder from where you are reading docx files and based on your OS select a way to show them. On my mac I used

    CMD + SHIFT + .
    

    Once you delete them, try the code again and it should work

    library(readtext)
    dat <- readtext("~/ownCloud/NLP/*.docx")