Search code examples
dataframesubsetjuliabootstrap-4any

checking whether a value of a variables belong to a set bootstrap


I have an array of integer say

theIndex = [ 1 2 6 7 17 2]

I have a dataframe with one column dataset[:id] containing integers say

dataset = DataFrame(id=[ 1, 1, 2, 2, 3, 3, 3, 4, 4, 4])

I want to select all observations in dataset that belongs to the index. and if they appear twice (or more) in the index, I want to select them twice (or more)

At the moment, I am doing it the dumb way.

theIndex = [ 1 2 6 7 17 2]
dataset = DataFrame(id=[ 1, 1, 2, 2, 3, 3, 3, 4, 4, 4])
dataset2 = DataFrame(id=Int64[])
for ii1=1:size(theIndex,2)
    for ii2=1:size(dataset[:id],1)
        any(i->i.==dataset[ii2,:id],theIndex[ii1]) ? 
        push!(dataset2,dataset[ii2,:id]) : nothing
    end
end

any more elegant solution?


Solution

  • Essentially, the question wants to calculate a SQL JOIN between theIndex and dataset. Unfortunately, this functionality is not fully implemented internally by DataFrames. So here is a quick (and efficient) simulation of a JOIN for this purpose:

    using DataStructures
    
    sort!(dataset, cols=:id]
    j = 1
    newvec = Vector{Int}() 
    for (val,cnt) in SortedDict(countmap(theIndex))
        while j<=nrow(dataset)
            dataset[j,:id] > val && break
            dataset[j,:id] == val && append!(newvec,fill(j,cnt))
            j += 1
        end
    end
    dataset2 = dataset[newvec,:]
    

    DataStructures package is used for the SortedDict. This implementation is should be more efficient than other multi-loop approaches.