Search code examples
listweb-scrapingbeautifulsoup

Read CSV file with BeautifulSoup


After scrape some info in a web site I got to save the file with the raw code in html format because I didn't arrive to a solution to find_all the text in a list of lists.

Now I have the data but I can't get the text because bs4 don't recognize the format list. Here's my open code:

with open('/my_file.csv', 'r') as read_obj:
    csv_reader = reader(read_obj)
    list_of_rows = list(csv_reader)
    print(list_of_rows)

This is the list format:

[['', '0', '1', '2', '3'], ['0','<span class="item">Red <small>col.</small></span>',
  '<span class="item">120 <small>cc.</small></span>',
  '<span class="item">Available <small>in four days</small></span>',
  '<span class="item"><small class="txt-highlight-red">15 min</small></span>'],
 ['1', '<span class="item">Blue <small>col.</small></span>',
  '<span class="item">200 <small>cc.</small></span>',
  '<span class="item">Available <small>in a week</small></span>',
  '<span class="item">04 mar <small></small></span>'],
 ['0', '<span class="item">Green <small>col.</small></span>',
  '<span class="item">Available <small>immediately</small></span>',
  '<span class="item"><small class="txt-highlight-red">2 hours</small></span>']]

Is there a way to read csv files in BeautifulSoup an then parse it?

The aim of the task is to keep only the text, removing everithing between '<>' (<> symbols included).


Solution

  • You can make a function that will apply the beautifulsoup object and return the text. if there are not tags/content to parse, it'll just leave as is.

    Also, I'd rather just use pandas to read in that csv.

    import pandas as pd
    from bs4 import BeautifulSoup
    
    df = pd.read_csv('/my_file.csv')
    
    def foo_bar(x):
        try:
            return BeautifulSoup(x, 'lxml').text
        except:
            return x
    
    print ('Parsing html in table...')
    df = df.applymap(foo_bar)
    

    Example input:

    df = pd.DataFrame([['0','<span class="item">Red <small>col.</small></span>',
      '<span class="item">120 <small>cc.</small></span>',
      '<span class="item">Available <small>in four days</small></span>',
      '<span class="item"><small class="txt-highlight-red">15 min</small></span>'],
     ['1', '<span class="item">Blue <small>col.</small></span>',
      '<span class="item">200 <small>cc.</small></span>',
      '<span class="item">Available <small>in a week</small></span>',
      '<span class="item">04 mar <small></small></span>'],
     ['0', '<span class="item">Green <small>col.</small></span>',
      '<span class="item">Available <small>immediately</small></span>',
      '<span class="item"><small class="txt-highlight-red">2 hours</small></span>']], columns = ['', '0', '1', '2', '3'])
    

    Original table:

    print (df.to_string())
                                                          0                                                  1                                                  2                                                  3
    0  0  <span class="item">Red <small>col.</small></span>   <span class="item">120 <small>cc.</small></span>  <span class="item">Available <small>in four da...  <span class="item"><small class="txt-highlight...
    1  1  <span class="item">Blue <small>col.</small></s...   <span class="item">200 <small>cc.</small></span>  <span class="item">Available <small>in a week<...   <span class="item">04 mar <small></small></span>
    2  0  <span class="item">Green <small>col.</small></...  <span class="item">Available <small>immediatel...  <span class="item"><small class="txt-highlight...                                               None
    

    Output:

    print (df.to_string())
                   0                      1                       2        3
    0  0    Red col.                120 cc.  Available in four days   15 min
    1  1   Blue col.                200 cc.     Available in a week  04 mar 
    2  0  Green col.  Available immediately                 2 hours     None