So, i've been looking all over and i can't seem to figure out why i can't get the results from my scrape to write to a xlsx file.
I'm running a list of urls from a .csv file. I throw 10 urls in there, beautifulsoup scrapes them. If i just print the dataframe, it comes our right.
If i try and save the results as a xlsx(which is preferred) or csv, it will only give me the results from the last url.
If i run this, it prints out perfect
with open('G-Sauce_Urls.csv' , 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for line in csv_reader:
r = requests.get(line[0]).text
soup = BeautifulSoup(r,'lxml')
business = soup.find('title')
companys = business.get_text()
phones = soup.find_all(text=re.compile("Call (.*)"))
Website = soup.select('head > link:nth-child(4)')
profile = (Website[0].attrs['href'])
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
But i can't seem to get it to append to an xlsx file. I'm only getting the last result, which i figure is because it is just "writing" and not appending.
I've tried:
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter', mode='a')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
writer.save()
AND
with ExcelWriter('path_to_file.xlsx', mode='a') as writer:
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
AND
df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
writer = pd.ExcelWriter("ProspectUploadSheetRob.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False, startrow=4, header=3)
writer.save()
AND
I started reading into openpyxl, but at this point I am so confused, i don't understand it.
Any and all help is appreciated
You are iterating over your csv data line-by-line, but you are recreating your dataframe at every iteration, so you are losing the value of the previous one each time. You will need to create the df first outside of the loop, and add data in your for loop.
df = pd.DataFrame(columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
>>> df
Empty DataFrame
Columns: [Required, First, Last, Required_no_Email, Business_Fax]
Index: []
Your assumption of writing and not appending is correct, but you need to append the dataframe and then write it to excel, and not append data to the excel(if I understood correctly).
data = {'Required':[companys], 'Required_no_Email':[phones], 'Business_Fax':[profile] }
df = df.append(data, ignore_index=True) # use this instead of this part of your original code below:
# df = pd.DataFrame(data, columns = ['Required','First', 'Last', 'Required_no_Email', 'Business_Fax'])
# this will not be required as you have already defined the df outside the loop
The pd.ExcelWriter will only produce the output when you run:
writer.save()
I have a similar code that opens the file with the following parameters and it works:
writer = pd.ExcelWriter(r'path_to_file.xlsx', engine='xlsxwriter')
... all my modifications ...
writer.save()
Note that according to the documentation 'w' or Write is the default mode, also when modifying object, and although not explained greatly, append is referenced only when adding entirely new excel objects(Sheets, etc.), or "extending" the document with another dataframe with the exact same format to the document structure. For it to be reproducable, you could add a template xlsx, but I hope it helps. Please let me know.