Search code examples
pythoncsvweb-scrapingbeautifulsoupdata-cleaning

Iterating through multiple html files and converting to csv


I have 32 separate html files with data in a table like format containing 8 columns of data. Each file is for a certain species of fungi.

I need to convert the 32 html files into 32 csv files with the data. I have the script for a single file, but can't figure out how to systematically do this with a few commands for all 32 files, instead of running the command I have 32 times.

Here is the script I am using in an attempt to make it loop through all 32 files:

directory = r'../html/species'
data = []
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')
            HTML_data = soup.find_all("table")[0].find_all("tr")[1:] 
            for element in HTML_data: 
                sub_data = [] 
                for sub_element in element: 
                    try: 
                        sub_data.append(sub_element.get_text())
                    except: 
                        continue
                data.append(sub_data) 
data

Here is some output data from the script above simplified for replication purposes:

[['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Kenya',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Shomari (1996); Ohler (1979); Mniu (1998); Nayar (1998)',
  ''],
 ['Malawi',
  'Present',
  '',
  '',
  '',
  '',
  'Malawi, Ministry of Agriculture (1990)',
  ''],
 ['Mozambique',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Mniu (1998); CABI (Undated)',
  ''],
 ['Nigeria',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Nayar (1998); CABI (Undated)',
  ''],
 ['South Africa', 'Present', '', '', '', '', 'Swart (2004)', ''],
 ['Tanzania',
  'Present',
  '',
  '',
  '',
  '',
  'Casulli (1979); Martin et al. (1997)',
  ''],
 ['Zambia',
  'Present',
  '',
  'Introduced',
  '',
  '',
  'Ohler (1979); Shomari (1996); Mniu (1998); Nayar (1998)',
  ''],
 ['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
 ['India', 'Present', '', 'Introduced', '', '', 'Intini (1987)', ''],
 ['\n\t\t\t\t\t\tSouth America\n\t\t\t\t\t'],
 ['Brazil', 'Present', '', '', '', '', 'Ponte (1986)', ''],
 ['-Sao Paulo',
  'Present',
  '',
  'Native',
  '',
  '',
  'Waller et al. (1992); Shomari (1996)',
  ''],
 ['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Egypt',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Ethiopia',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Libya',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Malawi',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Morocco',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Mozambique',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['South Africa',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Sudan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Tanzania',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Tunisia', 'Present', '', '', '', '', 'Djébali et al. (2009)', ''],
 ['Uganda',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['\n\t\t\t\t\t\tAsia\n\t\t\t\t\t'],
 ['Afghanistan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Armenia',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Azerbaijan',
  'Present',
  '',
  '',
  '',
  '',
  'Amano (1986); Braun (1995); Shin HyeonDong (2000); CABI and EPPO (2010)',
  ''],
 ['Bhutan', 'Present', '', '', '', '', 'CABI and EPPO (2010)', '']]

What I think I need is every species to be formatted more like this.. [[info_species1],[info_species1],[info_species1]], [[info_species2],[info_species2],[info_species2]] or in my output I need:

['-Sao Paulo',
  'Present',
  '',
  'Native',
  '',
  '',
  'Waller et al. (1992); Shomari (1996)',
  '']], # AN EXTRA SQUARE BRACKET RIGHT HERE
 ['\n\t\t\t\t\t\tAfrica\n\t\t\t\t\t'],
 ['Egypt',
  'Present',

Solution

  • Have you considered just reading in the table tags with pandas?

    import pandas as pd
    import os
    
    directory = r'../html/species'
    
    for filename in os.listdir(directory):
        if filename.endswith('.html'):
            csv_filename = filename.replace('.html','.csv')
            fname = os.path.join(directory,filename)
            with open(fname, 'r') as f:
                table = pd.read_html(f.read())[0]
                table.to_csv(csv_filename, index=False)
    
    print(data)