I have a paragraph of text scrambled by columns of two chars. The purpose of my assignment is to unscramble it:
|de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on|
|ry| |is|th|is| b|eo|as| | |f |wh| o|ic| t|, | |he|h |
|ab| |la|pr|od|ge|ob| m|an| |s |is|el|ti|ng|il|d |ua|c |
|he| |ea|of|ho| m| t|et|ha| | t|od|ds|e |ki| c|t |ng|br|
|wo|m,|to|yo|hi|ve|u | t|ob| |pr|d |s |us| s|ul|le|ol|e |
| t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er|
|ry|d |un|Th|" |io|eo|n,|is| |bl|f |pu|Co|ic| o|he|at|mm|
|hi| | |in| | | t| | | | |ye| |ar| |s | | |. |
My current approach to find the right order of columns is trying to recursively find each column's best position according to a word occurrence count criteria.
The pseudo-code of the algorithm's core I have in mind would be:
function unscramble(scrambledMatrix,indexOfColumnIveJustMoved)
for each column on scrambledMatrix as currentIndex=>currentColumn
if (currentIndex!=indexOfColumnIveJustMoved)
maxRepeatedWords=0;maxIndex=0;
for (i=0;i<numberOfColumnsOfScrambledMatrix;i++)
repWordsCount=countRepWords(moveFromToOn(currentIndex,i,scrambledMatrix))
if (maxRepeatedWords<repWordsCount)
maxRepeatedWords=repWordsCount;
maxIndex=i;
endif
endfor
if (maxIndex!=currentIndex)
return unscramble(moveFromToOn(currentIndex,maxIndex,scrambledMatrix),maxIndex); //recursive call
endif
endif
endfor
return(scrambledMatrix); //returns the unscrambled matrix;
endfunction
The algorithm stops when no column is moved after iterating on each one. I'm guessing it should work for any language (though I'm only interested on a solution for english) as long as the writing is based on words formed by letters and the sample is big enough.
Any suggestions on any other approaches or improvements? I would like to know the best solution for this problem (probably a dictionary based one looking for occurrences of common words instead? How about rebuilding the algorithm to avoid recursion, would it be much faster?).
A couple more ideas:
One way, although generally this approach is one of the most time consuming- is to use a genetic algorithm.
Lets say the current default arrangement of columns is
|de| | f|Cl|nf|ed|au| i|ti| |ma|ha|or|nn|ou| S|on|nd|on|
[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18] <--- define this to be a chromosome
You could create a population of 100, 1000 w/e number of chromosomes that start off randomly assigned (keep in mind the 'random' assignment cannot have duplicate numbers and must be valid)
Then run a fitness function on each assignment, or multiple fitness functions if you would like to break it down that way. Start off with one super fitness function that assigns a fitness value to each assignment.
Only take the top 50% of chromosomes and move them onto the next generation, where you create 'children' chromosomes based on your choice of a crossover function and a probability of mutation- for this type of problem I recommend a very light crossover function (or none...) and a decent mutation rate. If you can find the columns that do not contribute to words/ the fitness function much then maybe flip those around.
Keep doing this for many generations and see how the top rated assignment looks like each generation, you would expect the valid to plateau at some point and that would be your correct assignment.
This approach could only be mildly better than brute force with fitness function, but it could also turn out to be pretty good.
One last idea: try to abstract away from 'first column, second column' and assign the columns into chunks that form words, because just because [1,4,6....] turns out to form "the" "him" "her" etc, doesnt mean it belongs right in the beginning.
I have a different approach that I kind of like better, I think a Dynamic Algorithm would be better suited for this.
EDIT: Another Approach
Again based on the dictionary approach, but you would focus on choosing the first few columns before the rest, and if it falls apart and you are not getting words in any particular row it means your earlier selections were wrong and you will need to backtrack.
Select row 1.. well chances are there are not too many words here but you will narrow yourself down to a subset of your dictionary- the subset that has words that start with the chars in your first column.
Now you have a row that works, select an adjacent row to the right. If it either forms full words or still has valid words possible (given no space is present signifying end of word). Repeat.
If no adjacent rows are possible given your previous choices, backtrack one row to the left, and dont select the same thing again.
The weakness here is that your dictionary would need to contain all the words in your sentence, as well as all variants of the words. You might need to come up with a heuristic similar to a fitness function that says, "90% of words match, so this is still a valid attempt..." or something of that sort.