Search code examples
export-to-csvstrip

How to strip extra spaces when writing from dataframe to csv


Read in multiple sheets (6) from an xlsx file and created individual dataframes. Want to write each one out to a pipe delimited csv.

ind_dim.to_csv (r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')

Currently outputs like this: 1|value1 |value2 |word1 word2 word3 etc.

Want to strip trailing blanks


Solution

  • Suggestion

    Include the method .apply(lambda x: x.str.rstrip()) to your output string (prior to the .to_csv() call) to strip the right trailing blank from each field across the DataFrame. It would look like:

    Change:

    ind_dim.to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
    

    To:

    ind_dim.apply(lambda x: x.str.rstrip()).to_csv(r'/mypath/ind_dim_out.csv', index = None, header=True, sep='|')
    

    It can be easily inserted to the output code string using '.' referencing. To handle multiple data types, we can enforce the 'object' dtype on import by including the argument dtype='str':

    ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
    

    Or on the DataFrame itself by:

    df = pd.DataFrame(df, dtype='str')
    

    Proof

    I did a mock-up where the .xlsx document has 5 sheets, with each sheet having three columns: The first column with all numbers except an empty cell in row 2; the second column with both a leading blank and a trailing blank on strings, an empty cell in row 3, and a number in row 4; and the third column * with all strings having a leading blank, and an empty value in row 4*. Integer indexes and integer columns have been included. The text in each sheet is:

        0   1   2
    0   11111    valueB1     valueC1
    1        valueB2     valueC2
    2   33333        valueC3
    3   44444   44444   
    4   55555    valueB5     valueC5
    

    This code reads in our .xlsx testing_xlsx_dtype.xlsx to the DataFrame dictionary ind_dim.

    Next, it loops through each sheet using a for loop to place the sheet name variable as a key to reference the individual sheet DataFrame. It applies the .str.rstrip() method to the entire sheet/DataFrame by passing the lambda x: x.str.rstrip() lambda function to the .apply() method called on the sheet/DataFrame.

    Finally, it outputs the sheet/DataFrame as a .csv with the pipe delimiter using .to_csv() as seen in the OP post.

    # reads xlsx in 
    ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
    
    # loops through sheets, applies rstrip(), output as csv '|' delimit
    for sheet in ind_dim:
        ind_dim[sheet].apply(lambda x: x.str.rstrip()).to_csv(sheet + '_ind_dim_out.csv', sep='|')
    

    Returns:

    |0|1|2
    0|11111| valueB1| valueC1
    1|| valueB2| valueC2
    2|33333|| valueC3
    3|44444|44444|
    4|55555| valueB5| valueC5
    

    (Note our column 2 strings no longer have the trailing space).

    We can also reference each sheet using a loop that cycles through the dictionary items; the syntax would look like for k, v in dict.items() where k and v are the key and value:

    # reads xlsx in 
    ind_dim = pd.read_excel('testing_xlsx_nums.xlsx', header=0, index_col=0, sheet_name=None, dtype='str')
    
    # loops through sheets, applies rstrip(), output as csv '|' delimit
    for k, v in ind_dim.items():
        v.apply(lambda x: x.str.rstrip()).to_csv(k + '_ind_dim_out.csv', sep='|')
    

    Notes:

    We'll still need to apply the correct arguments for selecting/ignoring indexes and columns with the header= and names= parameters as needed. For these examples I just passed =None for simplicity.

    The other methods that strip leading and leading & trailing spaces are: .str.lstrip() and .str.strip() respectively. They can also be applied to an entire DataFrame using the .apply(lambda x: x.str.strip()) lambda function passed to the .apply() method called on the DataFrame.

    Only 1 Column: If we only wanted to strip from one column, we can call the .str methods directly on the column itself. For example, to strip leading & trailing spaces from a column named column2 in DataFrame df we would write: df.column2.str.strip().

    Data types not string: When importing our data, pandas will assume data types for columns with a similar data type. We can override this by passing dtype='str' to the pd.read_excel() call when importing.

    pandas 1.0.1 documentation (04/30/2020) on pandas.read_excel:

    "dtypeType name or dict of column -> type, default None

    Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion."

    We can pass the argument dtype='str' when importing with pd.read_excel.() (as seen above). If we want to enforce a single data type on a DataFrame we are working with, we can set it equal to itself and pass it to pd.DataFrame() with the argument dtype='str like: df = pd.DataFrame(df, dtype='str')

    Hope it helps!