Search code examples
pythonpython-re

Removing whitespaces/blankspaces/newlines from scraped data


I have scraped data from a url using beautiful soup. But after cleaning there are a number of blankspaces/ whitespaces/newlines in the cleaned data. I tried .strip() function to remove those. But it is still present.

Code

from bs4 import BeautifulSoup
import requests
import re
URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
html_content = requests.get(URL).text
cleantext = BeautifulSoup(html_content, "lxml").text
cleanr = re.compile('<.*?>')
clean_data = re.sub(cleanr, ' ', cleantext)
text = re.sub('([^\x00-\x7F]+)|(\n)|(\t)',' ', clean_data)
with open('read.txt', 'w') as file:
    file.writelines(text)

Output

   America the Beautiful: A Virtual Patriotic Salute   Flagstaff Symphony Orchestra                                                                                           Contact             Hit enter to search or ESC to close                                     About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets                  All Events   This event has passed. America the Beautiful: A Virtual Patriotic Salute  July 4, 2020         Violin Virtuoso Beethoven Virtual 5k             In place of our traditional 4th of July concert at the Pepsi Amphitheater, the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4, 2020 at 11am. The FSO is proud to offer a special rendition of  America the Beautiful  performed by 60 of their professional musicians, coming together virtually, to celebrate our nation s independence. CLICK HERE FOR DETAILS   + Google Calendar+ iCal Export     Details    Date:    July 4, 2020   Event Category: Concerts and Events             Violin Virtuoso Beethoven Virtual 5k                   Concert InfoConcerts Concerts and Events FAQs     FSO InfoAbout FSO Mission and History Our Team Our Conductor Orchestra Members     Support FSOMake a Donation Underwriting a Concert Sponsor a Chair Advertise with FSO Volunteer Leave a Legacy Donor Bill of Rights Code of Ethical Standards  (Used by permission of the Association of Fundraising Professionals)     ResourcesCommunity & Education For Musicians For Board Members             2021 Flagstaff Symphony Orchestra. 
           Copyright 2019 Flagstaff Symphony Association                             About  Our Team Our Conductor Orchestra Members   Concerts & Events  Season 72 Concerts Subscribe Venue, Parking & Concerts FAQs   Support The FSO  Donate to FSO Sponsor a Chair Funding and Impact   Videos Donate Subscription Tickets   Contact  

In the above code I replaced unicode characters with ' ' (blankspace). If i didnt replace with blank space then several words will be joined together. What i am trying to obtain is a string data type with no unnecessary spaces and new line data.

Added Question

I tried every methods like strip(), re.sub() etc to replace the space at the beginning of some lines in a text. But nothing works for the following data

Subscription Tickets
 All Events
This event has passed.
America the Beautiful: A Virtual Patriotic Salute
July 4, 2020
 Violin Virtuoso
Beethoven Virtual 5k 

How can we remove those spaces


Solution

  • It's not clear whether you want to retain some whitespaces for readability. In case you do, you can try this approach:

    Update: Added code to only retain alpha-numeric characters except for a character exclusion list.

    Code:

    from bs4 import BeautifulSoup
    import requests
    
    
    def clean_scraped_text(raw_text):
    
        # strip whitespaces from start and end of raw text
        stripped_text = raw_text.strip()
    
        processed_text = ''
        for i, char in enumerate(stripped_text):
            # add a single '\n' to processed_text for every sequence of '\n'
            if char == '\n':
                if stripped_text[i - 1] != '\n':
                    processed_text += '\n'
            else:
                # if character is not '\n' add it to new_text
                processed_text += char
    
        # clean whitespaces from each line in new_text
        cleaned_text = ''
        for line in processed_text.splitlines():
            # only retain alphanumeric characters and listed characters 
            exclude_list = [' ', '\xa0', '-']
            line = ''.join(x for x in line if x.isalnum() or (x in exclude_list))
            cleaned_text += line.strip() + '\n'
    
        return cleaned_text
    
    URL="https://www.flagstaffsymphony.org/event/a-flag-on-fourth/"
    html_content = requests.get(URL).text
    text = BeautifulSoup(html_content, "lxml").text
    print(clean_scraped_text(text))
    

    Output:

    America the Beautiful A Virtual Patriotic Salute  Flagstaff Symphony Orchestra
    
    Contact
    Hit enter to search or ESC to close
    
    
    About
    Our Team
    Our Conductor
    Orchestra Members
    Concerts  Events
    Season 72 Concerts
    Subscribe
    Venue Parking  Concerts FAQs
    Support The FSO
    Donate to FSO
    Sponsor a Chair
    Funding and Impact
    Videos
    Donate
    Subscription Tickets
    All Events
    This event has passed
    America the Beautiful A Virtual Patriotic Salute
    July 4 2020
    Violin Virtuoso
    Beethoven Virtual 5k
    In place of our traditional 4th of July concert at the Pepsi Amphitheater the Flagstaff Symphony Orchestra will present a virtual patriotic salute to be released HERE and our Facebook page at on July 4 2020 at 11am The FSO is proud to offer a special rendition of America the Beautiful performed by 60 of their professional musicians coming together virtually to celebrate our nations independence
    CLICK HERE FOR DETAILS
    Google Calendar iCal Export
    Details
    Date
    July 4 2020
    Event Category Concerts and Events
    
    Violin Virtuoso
    Beethoven Virtual 5k
    
    Concert InfoConcerts
    Concerts and Events FAQs
    
    FSO InfoAbout FSO Mission and History
    Our Team
    Our Conductor
    Orchestra Members
    Support FSOMake a Donation
    Underwriting a Concert
    Sponsor a Chair
    Advertise with FSO
    Volunteer
    Leave a Legacy
    Donor Bill of Rights
    Code of Ethical Standards  Used by permission of the Association of Fundraising Professionals
    ResourcesCommunity  Education
    For Musicians
    For Board Members
    2021 Flagstaff Symphony Orchestra
    Copyright 2019 Flagstaff Symphony Association
    
    
    About
    Our Team
    Our Conductor
    Orchestra Members
    Concerts  Events
    Season 72 Concerts
    Subscribe
    Venue Parking  Concerts FAQs
    Support The FSO
    Donate to FSO
    Sponsor a Chair
    Funding and Impact
    Videos
    Donate
    Subscription Tickets
    Contact