I have an array of integer say
theIndex = [ 1 2 6 7 17 2]
I have a dataframe with one column dataset[:id] containing integers say
dataset = DataFrame(id=[ 1, 1, 2, 2, 3, 3, 3, 4, 4, 4])
I want to select all observations in dataset that belongs to the index. and if they appear twice (or more) in the index, I want to select them twice (or more)
At the moment, I am doing it the dumb way.
theIndex = [ 1 2 6 7 17 2]
dataset = DataFrame(id=[ 1, 1, 2, 2, 3, 3, 3, 4, 4, 4])
dataset2 = DataFrame(id=Int64[])
for ii1=1:size(theIndex,2)
for ii2=1:size(dataset[:id],1)
any(i->i.==dataset[ii2,:id],theIndex[ii1]) ?
push!(dataset2,dataset[ii2,:id]) : nothing
end
end
any more elegant solution?
Essentially, the question wants to calculate a SQL JOIN between theIndex
and dataset
. Unfortunately, this functionality is not fully implemented internally by DataFrames. So here is a quick (and efficient) simulation of a JOIN for this purpose:
using DataStructures
sort!(dataset, cols=:id]
j = 1
newvec = Vector{Int}()
for (val,cnt) in SortedDict(countmap(theIndex))
while j<=nrow(dataset)
dataset[j,:id] > val && break
dataset[j,:id] == val && append!(newvec,fill(j,cnt))
j += 1
end
end
dataset2 = dataset[newvec,:]
DataStructures package is used for the SortedDict. This implementation is should be more efficient than other multi-loop approaches.