I use the example from [here]: https://tutorials.quanteda.io/machine-learning/topicmodel/
`require(quanteda)
require(quanteda.corpora)
require(lubridate)
require(topicmodels)
corp_news <- download('data_corpus_guardian')`
`corp_news_subset <- corpus_subset(corp_news, 'date' >= 2016)
dfmat_news <- dfm(corp_news, remove_punct = TRUE, remove = stopwords('en')) %>%
dfm_remove(c('*-time', '*-timeUpdated', 'GMT', 'BST')) %>%
dfm_trim(min_termfreq = 0.95, termfreq_type = "quantile",
max_docfreq = 0.1, docfreq_type = "prop")
dfmat_news
Document-feature matrix of: 6,000 documents, 4,534 features (97.2% sparse).
str(corp_news)
List of 4
$ documents:'data.frame': 6000 obs. of 10 variables:
..$ texts : chr [1:6000] "London masterclass on climate change | Do you want to understand more about climate change? On 14 March the Gua"| __truncated__ "As colourful fish were swimming past him off the Greek coast, Cathal Redmond was convinced he had taken some gr"| __truncated__ "FTSE 100 | -101.35 | 6708.35 | FTSE All Share | -58.11 | 3608.55 | Early Dow Indl | -201.40 | 16120.31 | Early "| __truncated__ "Australia's education minister, Christopher Pyne, has vowed to find another university to host the Bjorn Lombor"| __truncated__ …`
and as we can see there is a 97.2% sparse level. Additionally the structure of corp_news$documents$texts contains the different levels of documents.
In my case I have a dataframe (every row is a document):
`df <- data.frame(text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. <code> ste </code> Lorem Ipsum is simply dummy text of the printing and typesetting industry.", "<code> teft </code> Lorem Ipsum has been the industry's standard dummy text ever since the 1500s", "when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electro <code> type sth but you can't see it </code>"), stringsAsFactors = FALSE)`
I use this to remove some noise:
`mytext <- paste(unlist(df$text), collapse =" ")
mytext2 <- gsub("<code>.+?</code>", "", mytext)
cleanFun <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
mytext3 <- cleanFun(mytext2)
df2 <- gsub("\n", "", mytext3)`
However is that the document is unlisted and I receive sparse 0.0%
myDfm <- dfm(df2, remove_punct = TRUE, remove = stopwords('en'))
myDfm
Document-feature matrix of: 1 document, 28 features (0.0% sparse).
How is it possible to make df2 have the structure of every row like the df?
Not entirely sure what the question is but if you want to clean the text in df
and then convert it to a corpus, here would be the way to go:
df$text <- gsub("<.*?>", "", df$text)
corp <- corpus(df, text_field = "text")
dfm <- dfm(corp, remove_punct = TRUE, remove = stopwords('en'))
> dfm
Document-feature matrix of: 3 documents, 32 features (62.5% sparse).