Search code examples
c#algorithmlossless-compression

Algorithm to break down a list of item into the minimum number of groups that can be used to represent it


I'm sorry for the broad question, but I'm struggling on how to start with this one. I have a vague idea what I'm actually trying to do is something a compression algorithm does, but really I'm looking for someone to point me in the direction of a writeup I can use to base this off of.

Basically I have file. This file is made up of multiple lines. Each line is a comma separated list of items. I want to generate a new list of the minimum number of subsets of these list that can be used to build up the original file.

An example will probably illustrate this better, at its most basic if I have an input file of

A,B,C,D
E,B,C,D
F,B,C,D
G,B,C,H
A,I,J,D
1,2,3,4,5,6,7,8
9,5,6,7,2,3,4,1

my output should look like

A
B,C
D
E
F
G
I,J
1
2,3,4
5,6,7
8
9

as this is the minimum number of unique groups I need to build up the original list.

I'm pretty certain I'm describing the job of a compression algorithm (though what I'm doing this for isn't for compression) but I can't find any documentation for implementing something like this on a string array in C# as I am trying to - the best I can find is pre-built libraries for compressing/decompressing a byte array which isn't want I want. The closest I feel I've got is googling "Huffman coding", which seems along the right tracks for what I am attempting to do but to be honest I'm struggling to understand how to implement anything based off of what I am reading.


Solution

  • Seems straight-forward enough:

    First try A, the first character seen. See if it's always followed by the same character. Answer AB, and AI, so just [A]. Add [A] to the list. Remove all [A] from the input.

    Try B (next character found). See if it's always followed by the same character. Answer BC always, so yes, now also see if C appears where it's not preceded by B, it doesn't, so add [BC] to the list. Remove all [BC] from the input.

    And so on.