Search code examples
arraysjuliauniquerows

Get the unique rows with the number of repeatations if any


I wanted to get the index of each row with the number of duplicates as a tuple,

data=[1 2;2 3; 1 3;1 3]

unique_data=[findall(==(r),eachrow(data)) for r in unique(eachrow(data))]

unique_number=collect(zip(first.(unique_data), length.(unique_data).-1))

I am getting the right answer like

3-element Vector{Tuple{Int64, Int64}}:
 (1, 0)
 (2, 0)
 (3, 1)

I wanted to modify the code such that, even if data changes to as following data=[1 2;2 3; 1 3;3 1] get the same results as

3-element Vector{Tuple{Int64, Int64}}:
 (1, 0)
 (2, 0)
 (3, 1)

Solution

  • What you refer as "index of each row" is a fragile thing. I would recommend you to use the contents of the row as indicator. The easiest way to do it is to sort the row before matching, so you can do:

    julia> using StatsBase
    
    julia> countmap(sort.(eachrow(data)))
    Dict{Vector{Int64}, Int64} with 3 entries:
      [2, 3] => 1
      [1, 3] => 2
      [1, 2] => 1
    

    A more fancy way would be:

    julia> using DataFrames
    
    julia> df = DataFrame(original=collect(eachrow(data)))
    4×1 DataFrame
     Row │ original
         │ SubArray…
    ─────┼───────────
       1 │ [1, 2]
       2 │ [2, 3]
       3 │ [1, 3]
       4 │ [3, 1]
    
    julia> df.sorted = sort.(df.original)
    4-element Vector{Vector{Int64}}:
     [1, 2]
     [2, 3]
     [1, 3]
     [1, 3]
    
    julia> gdf = groupby(df, :sorted)
    GroupedDataFrame with 3 groups based on key: sorted
    First Group (1 row): sorted = [1, 2]
     Row │ original   sorted
         │ SubArray…  Array…
    ─────┼───────────────────
       1 │ [1, 2]     [1, 2]
    ⋮
    Last Group (2 rows): sorted = [1, 3]
     Row │ original   sorted
         │ SubArray…  Array…
    ─────┼───────────────────
       1 │ [1, 3]     [1, 3]
       2 │ [3, 1]     [1, 3]
    
    julia> [(rowid=first(sdf.original), rowlocs=parentindices(sdf)[1], entries=length(parentindices(sdf)[1])) for sdf in gdf]
    3-element Vector{NamedTuple{(:rowid, :rowlocs, :duplicates), Tuple{SubArray{Int64, 1, Matrix{Int64}, Tuple{Int64, Base.Slice{Base.OneTo{Int6, true}, Vector{Int64}, Int64}}}:
     (rowid = [1, 2], rowlocs = [1], entries = 1)
     (rowid = [2, 3], rowlocs = [2], entries = 1)
     (rowid = [1, 3], rowlocs = [3, 4], entries = 2)
    

    where you get the refrence row data, all row numbers where a given row is found and number of duplicates.