Im developing the following code to scrape financial data from a specific website source.
import requests
import pandas as pd
urls = ['https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow/quarter',
'https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow/quarter']
def main(urls):
with requests.Session() as req:
goal = []
for url in urls:
r = req.get(url)
df = pd.read_html(
r.content, match="Cash Dividends Paid - Total")[0].iloc[[0], 3:6]
goal.append(df)
new = pd.concat(goal)
print(new)
main(urls)
Im getting the information that I need.
2017 2018 2019 30-Sep-2019 31-Dec-2019 31-Mar-2020
0 (12.77B) (13.71B) (14.12B) NaN NaN NaN
0 NaN NaN NaN (3.48B) (3.54B) (3.38B)
0 (11.85B) (12.7B) (13.81B) NaN NaN NaN
0 NaN NaN NaN (3.51B) (3.89B) (3.88B)
I need to scrape at least 20 companies (from the same source). The URL is basically the same except for one element (I will call it index)
https://www.marketwatch.com/investing/stock/' + index + '/financials/cash-flow'
Is there a way to add a variable called Index
And iterate using the variable Index
Something like:
import requests
import pandas as pd
Index = 'MSFT, AAPL'
and
urls = ['https://www.marketwatch.com/investing/stock/' + Index + '/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/' + Index + '/financials/cash-flow/quarter']
Just straightforward solution, you can use loop inside loop and string formatting to construct the required URL.
For example:
import requests
import pandas as pd
indexes = 'aapl', 'MSFT', 'F'
def main(indexes):
urls = ['https://www.marketwatch.com/investing/stock/{index}/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/{index}/financials/cash-flow/quarter']
goal = []
with requests.Session() as req:
for index in indexes:
for url in urls:
url = url.format(index=index)
print('Processing url', url)
r = req.get(url)
df = pd.read_html(
r.content, match="Cash Dividends Paid - Total")[0].iloc[[0], 3:6]
goal.append(df)
new = pd.concat(goal)
print(new)
main(indexes)
Prints:
Processing url https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow
Processing url https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow/quarter
Processing url https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow
Processing url https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow/quarter
Processing url https://www.marketwatch.com/investing/stock/F/financials/cash-flow
Processing url https://www.marketwatch.com/investing/stock/F/financials/cash-flow/quarter
2017 2018 2019 30-Sep-2019 31-Dec-2019 31-Mar-2020
0 (12.77B) (13.71B) (14.12B) NaN NaN NaN
0 NaN NaN NaN (3.48B) (3.54B) (3.38B)
0 (11.85B) (12.7B) (13.81B) NaN NaN NaN
0 NaN NaN NaN (3.51B) (3.89B) (3.88B)
0 (2.58B) (2.91B) (2.39B) NaN NaN NaN
0 NaN NaN NaN (598M) (595M) (596M)