Search code examples
rtmcorpus

Removing rows with a specific word in Corpus


I have a Corpus with multiple texts (news articles) scraped from the internet.

Some of the texts contain the description of the photo that is used in the article. I want to remove that.

I found an existing string about this topic but it could not help me. See link: Removing rows from Corpus with multiple documents

I want to remove every row that contains the words "PHOTO FILE" (in caps). This solution was posted:

require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
for(j in seq(textVector)) {
newCorp<-textVector
newCorp[[j]] <- textVector[[j]][-grep("PHOTO",    textVector[[j]], ignore.case = FALSE)]
}

This does not seem to work for me though. The code runs but nothing is removed.

What does work is this:

require(tm)
corp <- VCorpus(VectorSource(txt))
textVector <- sapply(corp, as.character)
newCorp <- VCorpus(VectorSource(textVector[-grep("PHOTO", textVector, 
                                              ignore.case = FALSE)]))

But that removes every file that contains the word and I do not want that.

Would greatly appreciate if someone can help me on this.

Addition:

Here is an example of one of the texts:

[1] "Top News | Wed Apr 19, 2017 | 3:53pm BST\nFILE PHOTO: People walk accross a plaza in the Canary Wharf financial district, London, Britain, January 9, 2017. REUTERS/Dylan Martinez/File Photo\nLONDON Britain's current account deficit, one of the weak points of its economy, was bigger than previously thought in the years up to 2012, according to new estimates from the Office for National Statistics on Wednesday.\nThe figures showed British companies had paid out more interest to foreign holders of corporate bonds than initially estimated, resulting in a larger current account deficit.\nThe deficit, one of the biggest among advanced economies, has been in the spotlight since June's Brexit vote.\nBank of England Governor Mark Carney said in the run-up to the referendum that Britain was reliant on the \"kindness of strangers\", highlighting how the country needed tens of billions of pounds of foreign finance a year to balance its books.\nThe ONS said the current account deficit for 2012 now stood at 4.4 percent of gross domestic product, compared with 3.7 percent in its previous estimate.\nThe ONS revised up the deficit for every year dating back to 1998 by an average of 0.6 percentage points. The biggest revisions occurred from 2005 onwards.\nLast month the ONS said Britain's current account deficit tumbled to 2.4 percent of GDP in the final three months of 2016, less than half its reading of 5.3 percent in the third quarter.\nRevised data for 2012 onward is due on Sept. 29, and it is unclear if Wednesday's changes point to significant further upward revisions, as British corporate bond yields have declined markedly since 2012 and touched a new low in mid-2016. .MERUR00\nThe ONS also revised up its earlier estimates of how much Britons saved. The household savings ratio for 2012 rose to 9.8 percent from 8.3 percent previously, with a similar upward revision for 2011.\nThe ratio for Q4 2016, which has not yet been revised, stood at its lowest since 1963 at 3.3 percent.\nThe ONS said the changes reflected changes to the treatment of self-employed people paying themselves dividends from their own companies, as well as separating out the accounts of charities, which had previously been included with households.\nMore recent years may produce similarly large revisions to the savings ratio. Around 40 percent of the roughly 2.2 million new jobs generated since the beginning of 2008 fell into the self-employed category.\n"

So I wish to delete the sentence (row) of FILE PHOTO


Solution

  • Let's say that initially the text is contained in the file input.txt. The raw file is as follows:

    THis is a text that contains a lot
    of information
    and PHOTO FILE.
    Great!
    
    
    my_text<-readLines("input.txt")
    
    [1] "THis is a text that contains a lot" "of information"                     "and PHOTO FILE."                    "Great!"                            
    

    If you get rid of the spurious element

    blah[-grep("PHOTO FILE",blah,value = F,perl=T)]  
    

    you end up with

    [1] "THis is a text that contains a lot" "of information"                     "Great!"