I am trying to mimic the following browser actions via python's requests
:
isHistorical: true
)This is the code I have to simulate this:
import requests
import re
s = requests.Session()
r = s.get("https://www.bundesanzeiger.de/pub/en/to_nlp_start", verify=False, allow_redirects=True)
matches = re.search(
r'form class="search-form" id=".*" method="post" action="\.(?P<appendtxt>.*)"',
r.text
)
request_url = f"https://www.bundesanzeiger.de/pub/en{matches.group('appendtxt')}"
sr = session.post(request_url, data={'isHistorical': 'true', 'nlp-search-button': 'Search net short positions'}, allow_redirects=True)
However, even though sr
gives me a status_code 200, it's really an error when I check sr.url
, which shows https://www.bundesanzeiger.de/pub/en/error-404?9
Digging a bit deeper, I noticed that request_url
above resolves to something like
https://www.bundesanzeiger.de/pub/en/nlp;wwwsid=EFEB15CD4ADC8932A91BA88B561A50E9.web07-pub?0-1.-nlp~filter~form~panel-form
but when I check the request url in Chrome, it's actually
https://www.bundesanzeiger.de/pub/en/nlp?87-1.-nlp~filter~form~panel-form`
The 87
here seems to change, suggesting it's some session ID, but when I'm doing this using requests
it doesn't appear to resolve properly.
Any idea what I'm missing here?
You can try this script to download the CSV file:
import requests
from bs4 import BeautifulSoup
url = 'https://www.bundesanzeiger.de/pub/en/to_nlp_start'
data = {
'fulltext': '',
'positionsinhaber': '',
'ermittent': '',
'isin': '',
'positionVon': '',
'positionBis': '',
'datumVon': '',
'datumBis': '',
'isHistorical': 'true',
'nlp-search-button': 'Search+net+short+positions'
}
headers = {
'Referer': 'https://www.bundesanzeiger.de/'
}
with requests.session() as s:
soup = BeautifulSoup(s.get(url).content, 'html.parser')
action = soup.find('form', action=lambda t: 'nlp~filter~form~panel-for' in t)['action']
u = 'https://www.bundesanzeiger.de/pub/en' + action.strip('.')
soup = BeautifulSoup( s.post(u, data=data, headers=headers).content, 'html.parser' )
a = soup.select_one('a[title="Download as CSV"]')['href']
a = 'https://www.bundesanzeiger.de/pub/en' + a.strip('.')
print( s.get(a, headers=headers).content.decode('utf-8-sig') )
Prints:
"Positionsinhaber","Emittent","ISIN","Position","Datum"
"Citadel Advisors LLC","LEONI AG","DE0005408884","0,62","2020-08-21"
"AQR Capital Management, LLC","Evotec SE","DE0005664809","1,10","2020-08-21"
"BlackRock Investment Management (UK) Limited","thyssenkrupp AG","DE0007500001","1,50","2020-08-21"
"BlackRock Investment Management (UK) Limited","Deutsche Lufthansa Aktiengesellschaft","DE0008232125","0,75","2020-08-21"
"Citadel Europe LLP","TAG Immobilien AG","DE0008303504","0,70","2020-08-21"
"Davidson Kempner European Partners, LLP","TAG Immobilien AG","DE0008303504","0,36","2020-08-21"
"Maplelane Capital, LLC","VARTA AKTIENGESELLSCHAFT","DE000A0TGJ55","1,15","2020-08-21"
...and so on.