Search code examples
dataframejuliamissing-data

Initialize a column with missing values and copy+transform another column of a dataframe into the initialized column


I have a messy column in a csv file (column A of the dataframe).

using CSV, DataFrames
df = DataFrame(A = ["1", "3", "-", "4", missing, "9"], B = ["M", "F", "R", "G", "Z", "D"])

Want I want to do is:

  1. Transform the integer from string to numeric (e.g. Float64)
  2. Transform the string "-" in missing

The strategy would be to first define a new column vector filled with missing

df[:C] = fill(missing, size(df)[1])

and then perform the 2 transformations with for loops

for i in 1:size(df)[1]
    if df[:A][i] == "-"
        continue
    else
        df[:C][i] = parse(Float64,df[:A][i])
    end
end

However, when looking at df[:C] I have a column filled only with missing. What am I doing wrong?


Solution

  • There are several issues with your code, but first let me show how I would write this transformation:

    df.C = passmissing(parse).(Float64, replace(df.A, "-"=>missing))
    

    It is not the most efficient way to do it but is simple to reason about.

    An implementation using a loop could look like:

    df.C = similar(df.A, Union{Float64, Missing});
    
    for (i, a) in enumerate(df.A)
        if !ismissing(a) && a != "-"
            df.C[i] = parse(Float64, a)
        else
            df.C[i] = missing
        end
    end
    

    Note that similar by default will fill the df.C with missing so the else part could be dropped, but this behavior is not documented so it is safer to write it.

    You could also use a comprehension:

    df. C = [ismissing(a) || a == "-" ? missing : parse(Float64, a) for a in df.A]
    

    Now, to fix your code you could write:

    # note a different initialization
    # in your code df.C allowed only values of Missing type and disallows of Float64 type
    df.C = Vector{Union{Float64, Missing}}(missing, size(df, 1))
    
    for i in 1:size(df)[1]
        # note that we need to handle missing value and "=" separately
        if ismissing(df.A[i]) || df.A[i] == "-"
            continue
        else
            df.C[i] = parse(Float64,df.A[i])
        end
    end
    

    Finally note that it is preferred to write df.C than df[:C] to access a column in a data frame (currently both are equivalent but this might change in the future).