Search code examples
rtextparagraphtxt

Adding extra lines between paragraphs in txt files


I have a over 5000 txt files of news articles that look just like the one below. I am trying to create a corpus that creates a document for each paragraph of every txt file in a folder. There is a command (corpus_reshape) in the Quanteda R package that helps me to create this corpus with paragraphs, instead of full articles, as the documents. However, the command isn't able to identify the single "enter" paragraphs in the body of the article, but instead is looking for larger gaps between text to determine where one paragraph begins and one ends. In other words, from the text file below, the command only create for documents. The first documents starting with "Paying fo the Papal Visit, the second with "Copyright 1979 The Washington Post", the third with "NO ONE KNOWS" and the last with "End of Document". But the body of the article (between "Body" and "End of Text" actually consists of four paragraphs that the corpus_reshape couldn't identify.

enter image description here

So, I need to somehow go back through all 5,000+ txt files and increase the number of empty lines between paragraphs in the body of the text, so that, when I create the corpus, it can accurately parse out all paragraphs. I would greatly appreciate any help. Thank you!

Update: Below I have added both a link to the a downloadable copy of the txt file in the example as well as the pasted text as copied from the txt file.

Downloadable link:

https://drive.google.com/file/d/1SO5S5XgNlc4H-Ms8IEvK_ogZ57qSoTkw/view?usp=sharing

Pasted Text:

Paying for the Papal Visit The Washington Post September 16, 1979, Sunday, Final Edition Copyright 1979 The Washington Post Section: Outlook; Editorial; B6 Length: 403 words Body NO ONE KNOWS how much Pope John Paul II's week-long U.S. visit will end up costing -- or even how to calculate the cost. But already who picks up the tab has become a subject of considerable unnecessary controversy in three cities. Some religious and civil-liberties groups in Philadelphia and Boston are challenging -- or nit-picking -- proposals by governments in these cities to spend public money on facilities connected with outdoor papal masses; and in New York, local and Roman Catholic officials have been locked in negotiations over who will pay for what. But by and large, here in Washington and in Chicago and Des Moines, these details are being handled as they should be: without making a separation-of-church-and-state issue out of the logistics. Spending by a city for the smoothest and safest handling of a major event is a legitimate secular municipal function. Even a spokesman for Americans United for Separation of Church and State has agreed that there is nothing wrong with using public money for cleanup, police overtime, police protection and traffic control. Playing host to world figures and huge turnouts is indeed expensive, as District officials regularly remind us when they are haggling with Congress for more federal help with the bills. Still, here in the capital, whether it is the pope, angry American farmers, anti-war demonstrators or civil-rights marchers, public spending for special services is considered normal and essential. Much of the hair-splitting in other cities over the papal-visit expenses has to do with whether public money should pay for platforms from which the pope will celebrate outdoor masses. That's a long reach for a constitutional controversy, and not worth it. Far better is the kind of cooperation that separate church and state groups here are demonstrating in their planning. For example, there will be a chainlink fence surrounding the stage and altar from which the pope will say the mass on the Mall and extending to other nearby areas. The police recommended the fence, estimated to cost about $25,000, the church has agreed to pay for the portion around the stage and altar. To help clean up, the church plans to produce hundreds of Scouts on the Monday holiday for volunteer duty. This approach to the visit is a lot more sensible -- and helpful to all taxpayers -- than a drawn-out argument and threats of legal action. End of Document


Solution

  • This will add 3 lines at teh end of those paragraphs. THe logic used was to add the extra lines when the line length was greater than 50. You may wnat to modify that. It was chosen because the longest line in the "paragraphs" you were happy with was 46 characters.

    txt <- readLines("/home/david/Documents/R_code/misc files/WP_1979.9.16.txt")
    spread <- ifelse( nchar(txt) < 50, 
                      paste0(txt, "\n") , # these lines are left alone
                      paste0( txt, "\n\n\n\n") ) # longer lines are padded
    cat(spread, file="/home/david/Documents/R_code/misc files/spread.txt" )
    

    The cat function doesn't modify lines much, but it does omit the returns that were not included when readLines did the input. Some of the "lines" in the input text were just empty:

    nchar(txt)
     [1]   0   0  26  19  41   0   0  34  31  17   4   0   0 566 519 643 672   0   0  15
    

    Now the same operation on spread.txt yields a different "picture". I'm thinking that the added padding with "\n" characters is what is changing the counts, but I think that the corpus processing machinery will not mind:

    nchar( readLines("/home/david/Documents/R_code/misc files/spread.txt" ))
    #------------
     [1]   0   1  27  20  42   1   1  35  32  18   5   1   1 567   0   0   0 520   0   0   0 644   0   0   0 673   0
    [28]   0   0   1   1  16
    >