Search code examples
regexcsvnotepad++removing-whitespace

Trim values of CSV using Notepad++ only if the value is not blank


I have the following CSV:

COLUMN_A;COLUMN_B;COLUMN_C;COLUMN_D;COLUMN_E;COLUMN_F;COLUMN_G;COLUMN_H;COLUMN_I;COLUMN_L;
01234;AB ;00001; ;100000001;   ;ABC;+0000000000099998,080;XYZ        ;

I would like to remove the white spaces only if the value is not blank. So the result will be like this:

COLUMN_A;COLUMN_B;COLUMN_C;COLUMN_D;COLUMN_E;COLUMN_F;COLUMN_G;COLUMN_H;COLUMN_I;COLUMN_L;
01234;AB;00001; ;100000001;   ;ABC;+0000000000099998,080;XYZ;

I know that I can use find ' ' replace '' so the space will be replaced by nothing. But in this way, I will remove all the spaces, and I want to maintain the string with spaces where there aren't any other character.

If I have to use regular expression (my first time), I think I need to concatenate the following expression:

[a-zA-Z]
[\s]
[;]

so I can use [a-zA-Z][\s][;] in the find box,
but I don't know how to replace what it finds with [a-zA-Z][;]

UPDATE

The answer of @Trincot is good, but if one of the columns contains for example a description with spaces, these will be deleted

This
COLUMN_A;COLUMN_B;COLUMN_C;
01234;AB;My name is Paul;

will became:
COLUMN_A;COLUMN_B;COLUMN_C;
01234;AB;MynameisPaul;

for that reason, I prefer to capture the last char, all the white spaces and the semicolon

([a-zA-Z0-9])([ ]{1,})([;])

and replace it with the first and last group

$1$3

Solution

  • You can distinguish the cases where the string of spaces is not prefixed with a semi-colon or white-space, and the cases where the string of spaces is not suffixed with a semi-colon or white-space.

    For this you can use look-around:

    Find what: \h+(?![;\s])|(?<![;\s])\h+
    Replace with: (empty)
    ⦿ Regular expression

    Replace all

    Explanation

    • \h matches a horizontal white-space character (so not newline)
    • (?![;\s]) is a negative look ahead: what follows at the current position should not be one of those characters, but they are not included in the overall match.
    • (?<![;\s]) is a negative look behind: what precedes at the current position should not be one of those characters, but they are not included in the overall match.