Read in multiple sheets (6) from an xlsx file and created individual dataframes. Want to write each one out to a pipe delimited csv.
ind_dim.to_csv (r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
Currently outputs like this: 1|value1 |value2 |word1 word2 word3 etc.
Want to strip trailing blanks
Suggestion
Include the method .apply(lambda x: x.str.rstrip())
to your output string (prior to the .to_csv()
call) to strip the right trailing blank from each field across the DataFrame. It would look like:
Change:
ind_dim.to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
To:
ind_dim.apply(lambda x: x.str.rstrip()).to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
It can be easily inserted to the output code string using '.' referencing. To handle multiple data types, we can enforce the 'object' dtype on import by including the argument dtype='str'
:
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
Or on the DataFrame itself by:
df = pd.DataFrame(df, dtype='str')
Proof
I did a mock-up where the .xlsx document has 5 sheets, with each sheet having three columns: The first column with all numbers except an empty cell in row 2; the second column with both a leading blank and a trailing blank on strings, an empty cell in row 3, and a number in row 4; and the third column * with all strings having a leading blank, and an empty value in row 4*. Integer indexes and integer columns have been included. The text in each sheet is:
0 1 2
0 11111 valueB1 valueC1
1 valueB2 valueC2
2 33333 valueC3
3 44444 44444
4 55555 valueB5 valueC5
This code reads in our .xlsx testing_xlsx_dtype.xlsx
to the DataFrame dictionary ind_dim
.
Next, it loops through each sheet using a for loop to place the sheet name variable as a key to reference the individual sheet DataFrame. It applies the .str.rstrip()
method to the entire sheet/DataFrame by passing the lambda x: x.str.rstrip()
lambda function to the .apply()
method called on the sheet/DataFrame.
Finally, it outputs the sheet/DataFrame as a .csv with the pipe delimiter using .to_csv()
as seen in the OP post.
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for sheet in ind_dim:
ind_dim[sheet].apply(lambda x: x.str.rstrip()).to_csv(sheet + '_ind_dim_out.csv', sep='|')
Returns:
|0|1|2
0|11111| valueB1| valueC1
1|| valueB2| valueC2
2|33333|| valueC3
3|44444|44444|
4|55555| valueB5| valueC5
(Note our column 2 strings no longer have the trailing space).
We can also reference each sheet using a loop that cycles through the dictionary items; the syntax would look like for k, v in dict.items()
where k
and v
are the key and value:
# reads xlsx in
ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
# loops through sheets, applies rstrip(), output as csv '|' delimit
for k, v in ind_dim.items():
v.apply(lambda x: x.str.rstrip()).to_csv(k + '_ind_dim_out.csv', sep='|')
Notes:
We'll still need to apply the correct arguments for selecting/ignoring indexes and columns with the header=
and names=
parameters as needed. For these examples I just passed =None
for simplicity.
The other methods that strip leading and leading & trailing spaces are: .str.lstrip()
and .str.strip()
respectively. They can also be applied to an entire DataFrame using the .apply(lambda x: x.str.strip())
lambda function passed to the .apply()
method called on the DataFrame.
Only 1 Column: If we only wanted to strip from one column, we can call the .str
methods directly on the column itself. For example, to strip leading & trailing spaces from a column named column2
in DataFrame df
we would write: df.column2.str.strip()
.
Data types not string: When importing our data, pandas will assume data types for columns with a similar data type. We can override this by passing dtype='str'
to the pd.read_excel()
call when importing.
pandas 1.0.1 documentation (04/30/2020) on pandas.read_excel
:
"dtypeType name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion."
We can pass the argument dtype='str'
when importing with pd.read_excel.()
(as seen above). If we want to enforce a single data type on a DataFrame we are working with, we can set it equal to itself and pass it to pd.DataFrame()
with the argument dtype='str
like: df = pd.DataFrame(df, dtype='str')
Hope it helps!