I have a pandas data frame df
that looks like this.
import pandas as pd
url = "https://www-genesis.destatis.de/genesisWS/rest/2020/data/tablefile?username=DEB924AL95&password=P@ssword1234&name=42153-0002&area=all&compress=false&transpose=false&startyear=1900&endyear=×lices=®ionalvariable=®ionalkey=&classifyingvariable1=WERT03&classifyingkey1=BV4TB&classifyingvariable2=WZ08V2&classifyingkey2=&classifyingvariable3=&classifyingkey3=&format=xlsx&job=false&stand=01.01.1970&language=en"
df = pd.read_excel(url, engine='openpyxl')
df = df.iloc[5:-3]
df.columns = ['Variable', 'Date', 'Value']
m = df['Date'].isna()
df['Date'] += '-' + df['Variable'].ffill()
df['Variable'] = df['Variable'].where(m).ffill()
df
import numpy as np
# Reshape your dataframe
out = (df[~m].replace('...', np.nan)
.pivot_table(index='Date', columns='Variable',
values='Value', sort=False)
.reset_index().rename_axis(columns=None))
out
This gives me only a Date
column and no values.
Date
0 January-1991
1 February-1991
2 March-1991
3 April-1991
What changes do I need to do to make it work?
The issue is that you pivot_table
with numerical aggregation (mean
by default), but your data is not numeric. First convert with to_numeric
:
out = (df[~m].assign(Value=lambda d: pd.to_numeric(d['Value'], errors='coerce'))
.pivot_table(index='Date', columns='Variable',
values='Value', sort=False)
.reset_index().rename_axis(columns=None)
)
Excerpt of the output:
Date WZ08-05 Mining of coal and lignite WZ08-06 Extraction of crude petroleum and natural gas WZ08-08 Other mining and quarrying WZ08-09 Mining support service activities \
0 January-1991 415.7 186.6 113.8 358.6
1 February-1991 409.6 187.7 114.6 360.3
2 March-1991 402.8 188.6 115.5 361.0
3 April-1991 397.0 189.6 116.6 361.3
4 May-1991 391.1 190.7 117.7 361.8
.. ... ... ... ... ...
388 May-2023 26.2 61.5 88.2 60.7
389 June-2023 25.2 60.8 87.4 66.6
390 July-2023 24.3 60.1 86.7 72.5
391 August-2023 23.4 59.5 85.9 78.1
392 September-2023 22.4 58.7 85.2 83.8