Search code examples
pythonmysqlcsvbeautifulsoupexport-to-csv

Python splitting strings and convert them to a list that notices empty fields


it took me the whole day trying to fix this problem but I didn't found a solution so I hope you can help me. I already tried to extract the data from the website. But the problem is that I don't know how to split the list so that 500g converts to 500,g. The problem is that on the website sometimes the quantity is 1 and sometimes 1 1/2 kg or sth. And now I need to convert it into a CSV file and then into a MySQL database. What I want at the end is a CSV file with the columns: ingredients ID, ingredients, quantity, and the unit of the quantity from the ingredient. So for example: 0, meat, 500, g. This is the code I already have to extract the data from this website:

import re
from bs4 import BeautifulSoup
import requests
import csv

urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []

def read_recipes():
    for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
        soup2 = BeautifulSoup(requests.get(url).content, "lxml")
        for ingredient in soup2.select('.td-left'):
            menge.append([*[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
        for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
            if ingredient.name == 'h3':
                ingredients.append([id2, *[ingredient.get_text(strip=True)]])
            else:
                ingredients.append([id2, *[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])

        read_recipes()

I hope you can help me Thank You!


Solution

  • It appears that the strings containing fractions use the unicode symbols for 1/2 etc., so I think a good way of starting is replacing those by looking up the specific code and passing it to str.replace(). Splitting up the units and the amount for this example was easy, since they are separated by a space. But it might be necessary to generalize this more if you encounter other combinations. The following code works for this specific example:

    import re
    from bs4 import BeautifulSoup
    import requests
    import csv
    import pandas as pd
    
    urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
    mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
    urls_urls = []
    urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
    ingredients = []
    menge = []
    einheit = []
    
    
    for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
        soup2 = BeautifulSoup(requests.get(url).content)
        for ingredient in soup2.select('.td-left'):
            # get rid of multiple spaces and replace 1/2 unicode character
            raw_string = re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True)).replace(u'\u00BD', "0.5")
            # split into unit and number
            splitlist = raw_string.split(" ")
            menge.append(splitlist[0])
            if len(splitlist) == 2:
                einheit.append(splitlist[1])
            else:
                einheit.append('')
        for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
            if ingredient.name == 'h3':
                continue
            else:
                ingredients.append([id2, re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))])
    
    result = pd.DataFrame(ingredients, columns=["ID", "Ingredients"])
    result.loc[:, "unit"] = einheit
    result.loc[:, "amount"] = menge
    

    Output:

     >>> result
         ID                                        Ingredients   unit amount
     0    0  Beinscheibe(n), vom Rind, ca. 4 cm dick geschn...             4
     1    0                                               Mehl         etwas
     2    0                                         Zwiebel(n)             1
     3    0                                   Knoblauchzehe(n)             2
     4    0                                         Karotte(n)             1
     5    0                                     Lauchstange(n)             1
     6    0                                    Staudensellerie           0.5
     7    0                                Tomate(n), geschält   Dose      1
     8    0                                        Tomatenmark     EL      1
     9    0                              Rotwein zum Ablöschen
     10   0                       Rinderfond oder Fleischbrühe  Liter    0.5
     11   0                                Olivenöl zum Braten
     12   0                                     Gewürznelke(n)             2
     13   0                                       Pimentkörner            10
     14   0                                  Wacholderbeere(n)             5
     15   0                                      Pfefferkörner
     16   0                                               Salz
     17   0                    Pfeffer, schwarz, aus der Mühle
     18   0                                            Thymian
     19   0                                           Rosmarin
     20   0                            Zitrone(n), unbehandelt             1
     21   0                                   Knoblauchzehe(n)             2
     22   0                                    Blattpetersilie   Bund      1