Is there an algorithm for choosing a few strings from a list so that the number of strings equal the number of different letters in them?

Edit2: I think the solution of David Eisenstat works but I will check it before I call the question solved.

Example list of strings:

1.) "a"

2.) "ab"

3.) "bc"

4.) "dc"

5.) "efa"

6.) "ef"

7.) "gh"

8.) "hi"

You can choose number 1.) there's 1 string and 1 letter in it: "a" You can also choose 1.) and 2.) these are 2 strings with only two different letters in them "a" and "b"

other valid string combinations:

1.) 2.) 3.)

1.) 5.) 6.)

there's no valid combination with "h" (it would be ideal if cases like this could be proven however you can assume the program only needs to work when there's a valid answer)

There could be an extra condition that the strings you choose must include one specified letter, however simply finding all the possible combinations would solve the problem just as well. eg. specified letter "c" the only solution in this case would be: 1.) 2.) 3.)

[optional information] The purpose of this: I want to make a program which can choose from a big list of equations (probably around 100) which ones can be used to solve for a variable. Each equation gets one string, each letter in the string representing one unknown. The list of equations are all different eg. cannot be derived from each other, so you need as many equations as many unknowns there are in them. Solving for the unknowns will be done in a CAS, so you don't need to worry about it. However I believe the CAS (Maxima) might have a limit on how many equations it can solve simultaneously and it might be too slow if you give it too many unnecessary equations at a time.

As a start I would use an algorithm to reduce the number of strings just to make it faster. First all strings containing specified letter are in the reduced list, then all strings containing the letters from the strings in the reduced list are part of the reduced list until none is added. eg reduced list of "g" would be 7.) "gh" and 8.) "hi" This would only remove some unnecessary strings, but the task would remain the same with the rest.

I think this can be solved by taking away unnecessary strings from the reduced list until all the remaining are needed, however I don't know how to explicitly define which strings would be unnecessary (except for those mentioned in the previous paragraph).

If you work with the extra condition: This is an optimization task. I don't need a perfect solution, only an optimal solution. The program doesn't need to find the absolute minimum number of strings that give a solution. Having a few extra strings in the solution would probably only slow the computer down, but it would be acceptable.

Edit: Optional clarification about the meaning of the strings: Each letter in a string represent an unknown in an equation so the equation a=2 would be represented by "a" because that's the only unknown. The equation a+b=0 would be represented by "ab" and b^2-c=0 by "bc"

Solution

I'm not sure what to call this problem. It seems NP-hard, so I'm going to suggest an integer programming formulation, which can be attacked by an off-the-shelf solver.

Let x_i be a 0-1 variable indicating whether equation i is included in the output. Let y_j be a 0-1 variable indicating whether variable j is included in the output. We have constraints

for all equations i, for all variables j in equation i, y_j - x_i >= 0.

We need as many equations as variables in the output.

(sum over all equations i of x_i) - (sum over all variables j of y_j) = 0

As you point out, the empty set needs specifically to be disallowed. Let k be a variable that must appear in the output.

sum over all equations i containing variable k of x_i >= 1

Naturally, the objective is

minimize sum over all equations i of x_i.