Search code examples
web-scrapingbeautifulsoupjupyter-notebookurllib

How to fix python returning multiple lines in a .csv document instead of one?


I am trying to scrape data form a public forum for a school project, but every-time I run the code, the resulting .csv file shows multiple rows for the text variable instead of just one.


from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

my_url = 'https://www.emimino.cz/diskuse/1ivf-repromeda-56566/'

uClient = uReq(my_url)
page_soup = soup(uClient.read(),"html.parser")
uClient.close()

containers = page_soup.findAll("div",{"class":"discussion_post"})


out_filename = "Repromeda.csv"
headers = "text,user_name,date \n"

f = open(out_filename, "w")
f.write(headers)

for container in containers:
    text1 = container.div.p
    text = text1.text

    user_container = container.findAll("span",{"class":"user_category"})
    user_id = user_container[0].text

    date_container = container.findAll("span",{"class":"date"})
    date = date_container[1].text

    print("text: " + text + "\n" )
    print("user_id: " + user_id + "\n")
    print("date: " + date + "\n")
    # writes the dataset to file
    f.write(text.replace(",", "|") + ", " + user_id + ", " + date + "\n")

f.close()

Ideally I am trying to create a row for each data entry (ie. text, user_id, date in one row), but instead I get multiple rows for one text entry and only one row for user_id and date entry.

this is the actual output

this is the expected output


Solution

  • Just replace the new line with blank string.

    for container in containers:
       text1 = container.div.p
       text = text1.text.replace('\n', ' ')