I have a script to load and combine time-series data from two .csv files that have the same base filename (specified as a path using Pathlib
), but different suffixes. Given that the importing of the data from the .csv
files is working correctly, I include here a minimal working example with data provided using StringIO
:
import pandas as pd
from io import StringIO
def load_data():
headers_0 = ['a', 'b', 'c'] # Headers for first file. May have more entries than columns in file
headers_1 = ['d', 'e'] # Headers for second file.
data_0 = pd.read_csv(
StringIO(
"""
0 1 2
3 4 5
6 7 8
"""
)
, header=None, delim_whitespace=True)
data_0.columns = headers_0[0:data_0.shape[1]]
data_1 = pd.read_csv(
StringIO(
"""
A B
C D
"""
)
, header=None, delim_whitespace=True)
data_1.columns = headers_1[0:data_1.shape[1]]
data = data_0.join(data_1)
data.fillna(0, inplace=True)
print(data)
Thus far, I have only been using load_data
for datasets where both data_0
and data_1
have the same length of columns (same length of time-series). However, I am now encountering a situation where data_1
has a shorter column length than data_0
; this is because the data in data_1
only starts getting recorded at some later time than data_0
.
How do I use pandas to fill the columns of data_1
with leading zeros, such that the column length in both data_0
and data_1
is the same? I believe that the line data.fillna(0, inplace=True)
is filling the length mismatch with trailing zeros; is there an obvious way to change this to leading zeros? Note that I do not know the length of either dataset a priori, so I would appreciate help towards a solution that works based on the length of the data loaded using pandas. Running the example above gives an output
a b c d e
0 0 1 2 A B
1 3 4 5 C D
2 6 7 8 0 0
which is evidently not the desired effect (the zeros in columns d
and e
would be in row 0, rather than 2).
I have tried different options for DataFrame.fillna
such as method=backfill
, but none of these attempts have yielded the expected result.
Here's one approach:
dfs
with [::-1]
and apply df.reset_index
with drop=True
pd.concat
on axis=1
, and reverse + reset_index
againdf.fillna
import pandas as pd
from io import StringIO
def load_data(filename):
headers_0 = ['a', 'b', 'c']
headers_1 = ['d', 'e']
# using StringIO for mre
# + `sep="\s+"` for `delim_whitespace=True` (deprecated since 2.2.0)
data_0 = pd.read_csv(StringIO(c_0), header=None, sep="\s+")
data_0.columns = headers_0[0:data_0.shape[1]]
data_1 = pd.read_csv(StringIO(c_1), header=None, sep="\s+")
data_1.columns = headers_1[0:data_1.shape[1]]
data = (
pd.concat(
[df[::-1].reset_index(drop=True) for df in [data_0, data_1]],
axis=1
)[::-1]
.reset_index(drop=True)
.fillna(0)
)
return data
c_0 = """0 1 2
3 4 5
6 7 8"""
c_1 = """A B
C D
"""
load_data('c')
a b c d e
0 0 1 2 0 0
1 3 4 5 A B
2 6 7 8 C D
Will work regardless of which one is longer.
Another option could be to use df.shift
with fill_value
:
def load_data2(filename):
# `data_0` and `data_1` same as above
diff = len(data_0) - len(data_1)
data = pd.concat([data_0, data_1], axis=1)
if diff > 0:
data[data_1.columns] = data[data_1.columns].shift(diff, fill_value=0)
elif diff < 0:
data[data_0.columns] = data[data_0.columns].shift(abs(diff), fill_value=0)
return data
The nice thing about this option is that it will only fill NaN
values in the shifted area. E.g.:
c_0 = """0 1 2
3 4 5
6 7 8"""
c_1 = """A B
C
"""
load_data2('c')
a b c d e
0 0 1 2 0 0
1 3 4 5 A B
2 6 7 8 C NaN # with my `load_data` above, this `NaN` gets filled too!
Of course, if you want all NaN
values filled anyway, just use fillna
here as well.