python mysql csv beautifulsoup export-to-csv

Python splitting strings and convert them to a list that notices empty fields

it took me the whole day trying to fix this problem but I didn't found a solution so I hope you can help me. I already tried to extract the data from the website. But the problem is that I don't know how to split the list so that 500g converts to 500,g. The problem is that on the website sometimes the quantity is 1 and sometimes 1 1/2 kg or sth. And now I need to convert it into a CSV file and then into a MySQL database. What I want at the end is a CSV file with the columns: ingredients ID, ingredients, quantity, and the unit of the quantity from the ingredient. So for example: 0, meat, 500, g. This is the code I already have to extract the data from this website:

import re
from bs4 import BeautifulSoup
import requests
import csv

urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []

def read_recipes():
    for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
        soup2 = BeautifulSoup(requests.get(url).content, "lxml")
        for ingredient in soup2.select('.td-left'):
            menge.append([*[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
        for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
            if ingredient.name == 'h3':
                ingredients.append([id2, *[ingredient.get_text(strip=True)]])
            else:
                ingredients.append([id2, *[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])

        read_recipes()

I hope you can help me Thank You!

Solution

It appears that the strings containing fractions use the unicode symbols for 1/2 etc., so I think a good way of starting is replacing those by looking up the specific code and passing it to str.replace(). Splitting up the units and the amount for this example was easy, since they are separated by a space. But it might be necessary to generalize this more if you encounter other combinations. The following code works for this specific example:

import re
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd

urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []
einheit = []


for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
    soup2 = BeautifulSoup(requests.get(url).content)
    for ingredient in soup2.select('.td-left'):
        # get rid of multiple spaces and replace 1/2 unicode character
        raw_string = re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True)).replace(u'\u00BD', "0.5")
        # split into unit and number
        splitlist = raw_string.split(" ")
        menge.append(splitlist[0])
        if len(splitlist) == 2:
            einheit.append(splitlist[1])
        else:
            einheit.append('')
    for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
        if ingredient.name == 'h3':
            continue
        else:
            ingredients.append([id2, re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))])

result = pd.DataFrame(ingredients, columns=["ID", "Ingredients"])
result.loc[:, "unit"] = einheit
result.loc[:, "amount"] = menge

Output:

 >>> result
     ID                                        Ingredients   unit amount
 0    0  Beinscheibe(n), vom Rind, ca. 4 cm dick geschn...             4
 1    0                                               Mehl         etwas
 2    0                                         Zwiebel(n)             1
 3    0                                   Knoblauchzehe(n)             2
 4    0                                         Karotte(n)             1
 5    0                                     Lauchstange(n)             1
 6    0                                    Staudensellerie           0.5
 7    0                                Tomate(n), geschält   Dose      1
 8    0                                        Tomatenmark     EL      1
 9    0                              Rotwein zum Ablöschen
 10   0                       Rinderfond oder Fleischbrühe  Liter    0.5
 11   0                                Olivenöl zum Braten
 12   0                                     Gewürznelke(n)             2
 13   0                                       Pimentkörner            10
 14   0                                  Wacholderbeere(n)             5
 15   0                                      Pfefferkörner
 16   0                                               Salz
 17   0                    Pfeffer, schwarz, aus der Mühle
 18   0                                            Thymian
 19   0                                           Rosmarin
 20   0                            Zitrone(n), unbehandelt             1
 21   0                                   Knoblauchzehe(n)             2
 22   0                                    Blattpetersilie   Bund      1