Search code examples
juliadataframes.jl

Using a symbol directly versus creating a variable pointing to a symbol are giving different results - why?


Here's the code:

using DataFrames, DataFramesMeta

# Creating a sample DataFrame
df = DataFrame(ID = 1:5, Med1 = [0, 1, 0, 1, 0])

# Using @rsubset directly
result1 = @rsubset df :Med1 == 0

# Using a symbol
st = Symbol("Med1")
@rsubset df st == 0

# Checking if the results are the same
isequal(result1, result2)

The result is false - why?

I've been trying many different combinations of this and if I don't define the symbol directly on the expression, it never works. I'd appreciate some advice on what are the best practices for working with Dataframe's column naming conventions (I have a bunch of datasets with columns labeled with numbers like "Med1", "Med2", etc ... and I wanna iterate on those numbers, which is how I ended up trying to create Symbols)


Solution

  • Near the end of the Introduction section of the docs has:

    To reference columns inside DataFramesMeta macros, use Symbols. For example, use :x to refer to the column df.x. To use a variable varname representing a Symbol to refer to a column, use the syntax $varname.

    So (as the comment mentions), you need $st to have st's value used as the column name.

    The reason for this (to my understanding) doesn't have to do with any limitations or inner workings of Julia's metaprogramming, but rather with convention. st == 0 looks like it's comparing st's value to 0, so to have it silently compare the column whose name is contained within st would be unexpected and "magical". When building large codebases, this kind of magic tends to make the code less readable and maintainable. Explicitly marking the column accesses with : or $ makes it easier to see where we're referring to a column, vs. where we're accessing a variable for its own value.

    (There do exist packages like Tidier.jl which trade off being somewhat more magical for the sake of convenience. For eg. @rsubset df :Med1 == 0 would be written as @filter df Med1 == 0 in Tidier, with the name "Med1" automatically referring to the column. This is an exception that's explicitly intended to follow R's conventions rather than Julia's.)

    Having column access have special syntax also makes it easier to access normal variables in your code, for eg.

    x, y = some_calculation() 
    @rsubset df $st == x + y
    

    Here, because column access has special syntax ($), there's no confusion about x or y - they refer to the normal variables x and y as expected.

    (In contrast, since Tidier doesn't require special syntax for column names, it goes the other way and has special syntax for referring to normal variables, for eg. @filter df Med1 == !!x + !!y.)

    So ultimately, it's a design decision by DataFramesMeta developers, not something intrinsic to Julia metaprogramming.