The problem is explained in the following article.
I have a list of sentences, for example a list of 1000 sentences.
I would like to find a combination of sentences to match/'match closest' a certain frequency table:
[a:100, b:80, c:90, d:150, e:100, f:100, g:47, h:10 ..... z:900]
I thought about finding all possible combinations from the sentences list by using combinations like in here (so comb(1000, 1); to comb(1000, 1000); ) and then compare every combination with the frequency table, so that distance is minimum. So sum all frequency tables from a possible combination and compare this sum with the target, the combination with the smallest difference with the target should be recorded. There could be multiple combinations that match closest.
The problem is that the calculation of all combinations takes way too long to complete, apparently couple of days. Is there a known algorithm that could solve this efficiently? Ideally couple of minutes maximum?
Input sentences:
More RVs were seen in the storage lot than at the campground.
She did her best to help him. There have been days when I wished to be separated from my body, but today wasn’t one of those days.
The swirled lollipop had issues with the pop rock candy.
The two walked down the slot canyon oblivious to the sound of thunder in the distance.
Acres of almond trees lined the interstate highway which complimented the crazy driving nuts.
He is no James Bond; his name is Roger Moore.
The tumbleweed refused to tumble but was more than willing to prance.
She was disgusted he couldn’t tell the difference between lemonade and > limeade.
He didn’t want to go to the dentist, yet he went anyway.
Find combination of sentences that match the following frequency table closest:
[a:5, b:5, c:5, d:5, e:5, f:5, g:5, h:5 ..... z:5]
Example:
Frequency table of sixth sentence
He is no James Bond; his name is Roger Moore.
is [a:2, e:5, g:1, h:1, i:3, j:1, m:3, n:3, o:5, r:3, s:4]
Frequency table takes upper and lower equal and excludes special characters.
Whenever someone find a combination of sentences with 3c, 3a, 3b, 3d or 30c, 30a, 30b, 30d from the following sentences with 5% above or below it can be solved.
S1: aaaaaaaaaaaaaaaaaa bbbbbb c
S2: aaaaaaaa bbbbbbbb d
S3: aaaaaaaaaaa bbbbbbbbb c dd
S4: aaaaaaaaaa bbbbbbbb
Be realistic. There is No solution, not NP-hard nor NP-complete, No solution. The number of occurrence of letters in a sentence (for example vowels like i or a) is not equal to others (like x or w). We can just find best matches like the code provided here or change the requirement. I tried to solve this with KnapSack algorithm and Euclidean distance and Standard deviation, but none gives me such answer since there is no sentence with the same size of letters.