I'm trying to load a large number of JSON files from a news website into a quanteda corpus using readtext
. To simplify the process, the JSON files are all in the working directory. But I have also tried them in their own directory.
c()
to create a variable that explicitly defines a small subset of files, readtext
works as hoped and a corpus is properly created with corpus()
.list.files()
to list all of the +1500 JSON files readtext
does not work as hoped, errors are returned, and a corpus is not created.I tried to inspect the results of the two methods of defining the set of texts (i.e. c()
and list.files()
) as well as paste0()
.
# Load libraries
library(readtext)
library(quanteda)
# Define a set of texts explicitly
a <- c("border_2020_05_10__1589150513.json","border_2020_05_10__1589143358.json","border_2020_05_07__1589170960.json")
# This produces a corpus
extracted_texts <- readtext(a, text_field = "maintext")
my_corpus <- corpus(extracted_texts)
# Define a set of all texts in working directory
b <- list.files(pattern = "*.json", full.names = F)
# This, which I hope to use, produces an error
extracted_texts <- readtext(b, text_field = "maintext")
my_corpus <- corpus(extracted_texts)
The error produced by extracted_texts <- readtext(b, text_field = "maintext")
is as follows
File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.
This is perplexing because the same files called with a
do not produce an error. I validated several of the JSON files which in every case returned VALID (RFC 8259), the IETF standard for JSON.
Inspecting the differences between a
and b
:
typeof()
returns "character"
for both a
and b
.is.vector()
and is.atomic()
return TRUE
for both.is.list()
returns FALSE
for both.I'm really confused why a
works and b
does not.
Lastly, attempting to exactly mimic procedures employed at the readtext documentation the following was also tried:
# XXXX = my username
data_dir <- file.path("C:/Users/XXXX/Documents/R/")
d <- readtext(paste0(data_dir, "/corpus_linguistics/*.json"), text_field = "maintext")
This also returned the error
File doesn't contain a single valid JSON object.
Error: This JSON file format is not supported.
At this point I'm stumped. Thanks in advance for any insight on how to move forward.
main_text
field. These are not useful for analysis and should be removed. All of the files contain a JSON field called "title_rss"
that is null. This can be eliminated through a directory level find and replace with Notepad ++, or probably R or Python though I still lack the skills for this. Additionally, the files were not in UTF-8 encoding, that was resolved with Codepage Converter.list.files()
method is employed in the readtext How to Use documentation and several third party tutorials. This method works with *.txt files but for some reason it does not seem to work with these particular JSON files. Once the JSON files are properly cleaned and encoded, the method below works without errors. If the data_dir
is wrapped in a list.files()
function it produces the following error:
Error in list_files(file, ignore_missing, TRUE, verbosity) : File '' does not exist.
I'm not sure why that is, but leaving it out works for these JSON files.# Load libraries
library(readtext)
library(quanteda)
# Define a set of texts explicitly
data_dir <- "C:/Users/Nathan/Documents/R/corpus_linguistics/"
extracted_texts <- readtext(paste0(data_dir, "texts_unmodified/*.json"), text_field = "maintext", verbosity = 3)
my_corpus <- corpus(extracted_texts)
Input: 5 files consisting of 4 w/o an empty or null text_field
and 1 file with a null text field
. In addition, all of the files have Western European (Windows) 1252 Encoding.
Errors:
Reading texts from C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/*.json
, using glob pattern
... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_02_17__1589147645.json
File doesn't contain a single valid JSON object.
contain a single valid JSON object.
... reading (json) file: C:/Users/Nathan/Documents/R/corpus_linguistics/texts_unmodified/border_2014_03_13__1589150325.json
File doesn't contain a single valid JSON object.
Column 14 ['maintext'] of item 1 is length 0. This (and 0 others like it) has been filled with NA (NULL for list columns) to make each item uniform. ... read 5 documents.
Result: a properly formed corpus consisting of 5 documents. One document lacks either tokens or types. The corpus seems to build properly despite the errors. Perhaps some special characters don't display properly because of the encoding issue. I was not able to check this.
Input files: 4 files that have no empty or null JSON fields. In all cases,text_field
contains text and the title_rss
field was removed. Each of the files was converted from Western European (Windows) 1252 into Unicode UTF-8-65001.
Errors: NONE!
Result: A properly formed corpus.
Many thanks to the two developers for detailed feedback and useful leads. The assistance is deeply appreciated.
There are a few possibilities here, but the most likely are:
One of your files has a malformed JSON structure, from the point of view of readtext()
. Even though this might be OK from a strictly JSON format, if one of your text fields is empty, for instance, then this will cause the error. (See below for a demonstration and a solution.)
While readtext()
can take a "glob" pattern match, list.files()
takes a regular expression. It's possible (but unlikely) that you are picking up something you don't want then in list.files(pattern = "*.json"...
. But this should not be necessary with readtext()
-- see below.
To demonstrate, let's write out each document in data_corpus_inaugural
as a separate JSON file, and then read them in using readtext()
.
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
tmpdir <- tempdir()
corpdf <- convert(data_corpus_inaugural, to = "data.frame")
for (d in corpdf$doc_id) {
cat(jsonlite::toJSON(dplyr::filter(corpdf, doc_id == d)),
file = paste0(tmpdir, "/", d, ".json")
)
}
head(list.files(tmpdir))
## [1] "1789-Washington.json" "1793-Washington.json" "1797-Adams.json"
## [4] "1801-Jefferson.json" "1805-Jefferson.json" "1809-Madison.json"
To read them in, you can use the "glob" pattern patch here and just read the JSON files.
rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
text_field = "text", docid_field = "doc_id"
)
summary(corpus(rt), n = 5)
## Corpus consisting of 58 documents, showing 5 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington.json 625 1537 23 1789 Washington George
## 1793-Washington.json 96 147 4 1793 Washington George
## 1797-Adams.json 826 2577 37 1797 Adams John
## 1801-Jefferson.json 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson.json 804 2380 45 1805 Jefferson Thomas
## Party
## none
## none
## Federalist
## Democratic-Republican
## Democratic-Republican
So that all worked fine.
But if we add to this one file whose text field is empty, then this produces the error in question:
cat('[ { "doc_id" : "d1", "text" : "this is a file" },
{ "doc_id" : "d2", "text" : } ]',
file = paste0(tmpdir, "/badfile.json")
)
rt <- readtext::readtext(paste0(tmpdir, "/*.json"),
text_field = "text", docid_field = "doc_id"
)
## File doesn't contain a single valid JSON object.
## Error: This JSON file format is not supported.
True, that was not a valid JSON file, since it contained a tag with no value. But I suspect you have something like that in one of your files.
Here's how you can identify the problem: loop through your b
(from the question, not as I've specified it below).
b <- tail(list.files(tmpdir, pattern = ".*\\.json", full.names = TRUE))
for (f in b) {
cat("Reading:", f, "\n")
rt <- readtext::readtext(f, text_field = "text", docid_field = "doc_id")
}
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2001-Bush.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2005-Bush.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2009-Obama.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2013-Obama.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/2017-Trump.json
## Reading: /var/folders/92/64fddl_57nddq_wwqpjnglwn48rjsn/T//RtmpuhmGRK/badfile.json
## File doesn't contain a single valid JSON object.
## Error: This JSON file format is not supported.