Search code examples
regexawkfloating-pointis-emptyscientific-notation

AWK filter rows with valid numbers or empty


I try to use awk to filter some rows based on whether a given column contains an integer, floating point or empty string.

I have problems with the integers or the way to add them to the rule but avoiding adding weird expressions like 2.e.

INPUT:
220802,2249,3
220802,2250,37.5
220802,2251,2.e
220802,2252,28
220802,2253,
220802,2254,19.5
220802,2255,3.a
220802,2256,123.25
220802,2257,1.32e-8
220802,2258,N/A
EXPECTED OUTPUT:
220802,2249,3
220802,2250,37.5
220802,2252,28
220802,2253,
220802,2254,19.5
220802,2256,123.25
220802,2257,1.32e-8

I try this, but fails:

awk -F , '$3 ~ /(^$|[0-9]+\.?[0-9]+e?[-+]?[0-9]?+)/' INPUT
220802,2250,37.5
220802,2252,28
220802,2253,
220802,2254,19.5
220802,2256,123.25
220802,2257,1.32e-8

Also try this:

awk -F , '$3 ~ /(^$|[0-9]+|[0-9]+\.?[0-9]+e?[-+]?[0-9]?+)/'
220802,2249,3
220802,2250,37.5
220802,2251,2.e
220802,2252,28
220802,2253,
220802,2254,19.5
220802,2255,3.a
220802,2256,123.25
220802,2257,1.32e-8

Any aproach to consider? Thanks in advance!


Solution

  • Checking for empty is easy. You could use $3~/^$/ (as in your regex) but in this case, as you will also test if the field is a number, it's even shorter to logically "not" it, making use of the fact that only empty string and numbers equal to zero (0, 0.0, -0, etc) are false.

    To test if awk considers a field (or variable) to be a number, you can compare it to itself plus zero.

    So:

    awk -F, '$3~/^$/ || ($3+0 == $3)' INPUT
    

    or:

    awk -F, '!$3 || ($3+0 == $3)' INPUT
    

    The reason your regex didn't work is that you didn't anchor it. For example, [0-9]+ matches if there is a digit, even when it is embedded in something that is not a number.

    Matching all valid representations of number is quite error-prone.