Search code examples
dataframefilterjuliacriteria

Selecting Columns Based on Multiple Criteria in a Julia DataFrame


I need to select values from a single column in a Julia dataframe based on multiple criteria sourced from an array. Context: I'm attempting to format the data from a large Julia DataFrame to support a PCA (primary component analysis), so I first split the original data into an anlytical matrix and a label array. This is my code, so far (doesn't work):

### Initialize source dataframe for PCA
dfSource=DataFrame(
    colDataX=[0,5,10,15,5,20,0,5,10,30],
    colDataY=[1,2,3,4,5,6,7,8,9,0],
    colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
### Extract 1/2 of rows into analytical matrix
matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
###  Extract last column as labels
arLabels=dfSource[1:2:end,3]
###  Select filtered rows
datGet=matSource[:,arLabels>=0.2 & arLabels<0.7][1,:]
print(datGet)

output> MethodError: no method matching...

At the last line before the print(datGet) statement, I get a MethodError indicating a method mismatch related to use of the & logic. What have I done wrong?


Solution

  • This code works:

    dfSource=DataFrame(
        colDataX=[0,5,10,15,5,20,0,5,10,30],
        colDataY=[1,2,3,4,5,6,7,8,9,0],
        colRowLabels=[0.2,0.3,0.5,0.6,0.0,0.1,0.2,0.1,0.8,0.0])
    
    matSource=convert(Matrix,DataFrame(dfSource[1:2:end,1:2]))'
    
    arLabels=dfSource[1:2:end,3]
    
    datGet=matSource[:,(arLabels.>=0.2) .& (arLabels.<0.7)][1,:]
    print(datGet)
    

    output> [0,10,0]

    Note the use of parenthetical enclosures (arLabels.>=0.2) and (arLabels<0.7), as well as the use of the .>= and .< syntax (which forces Julia to iterate through a container/collection). Finally, and most crucially (since it's the part most people miss), note the use of .& in place of just &. The dot operator makes all the difference!