How to check user choice algorithm

I have an algorithm that chooses a list of items that should fit the user's likings.
I'll skip the algorithm's details because of confidentiality issues...

Now, I'm trying to think of a way to check it statistically, with a group of people.
The way I'm checking it now is:

Algorithm gets best results per user.
shuffle top 5 results with lowest 5 results.
make person list the results he liked by order (0 = liked best, 9 = didn't like)
compare user results to algorithm results.

I'm doing this because i figured that to show that algorithm chooses good results, i need to put in some bad results and show that the algorithm knows its a bad result as well.

So, what I'm asking is:

Is shuffling top results with low results is a good idea ?

And if not, do you have an idea on how to get good statistics on how good an algorithm matches user preferences (we have users that can choose stuff) ?

Solution

First ask yourself:

What am I trying to measure?

Not to rag on the other submissions here, but while mjv and Sjoerd's answers offer some plausible heuristic reasons for why what you are trying to do may not work as you expect; they are not constructive in the sense that they do not explain why your experiment is flawed, and what you can do to improve it. Before either of these issues can be addressed, what you need to do is define what you hope to measure, and only then should you go about trying to devise an experiment.

Now, I can't say for certain what would constitute a good metric for your purposes, but I can offer you some suggestions. As a starting point, you could try using a precision vs. recall graph:

http://en.wikipedia.org/wiki/Precision_and_recall

This is a standard technique for assessing the performance of ranking and classification algorithms in machine learning and information retrieval (ie web searching). If you have an engineering background, it could be helpful to understand that precision/recall generalizes the notion of precision/accuracy:

http://en.wikipedia.org/wiki/Accuracy_and_precision

Now let us suppose that your algorithm does something like this; it takes as input some prior data about a user then returns a ranked list of other items that user might like. For example, your algorithm is a web search engine and the items are pages; or you have a movie recommender and the items are books. This sounds pretty close to what you are trying to do now, so let us continue with this analogy.

Then the precision of your algorithm's results on the first n is the number of items that the user actually liked out of your first to top n recommendations:

    precision = #(items user actually liked out of top n) / n

And the recall is the number of items that you actually got right out of the total number of items:

    recall = #(items correctly marked as liked) / #(items user actually likes)

Ideally, one would want to maximize both of these quantities, but they are in a certain sense competing objectives. To illustrate this, consider a few extremal situations: For example, you could have a recommender that returns everything, which would have perfect recall, but very low precision. A second possibility is to have a recommender that returns nothing or only one sure-fire hit, which would have (in a limiting sense) perfect precision, but almost no recall.

As a result, to understand the performance of a ranking algorithm, people typically look at its precision vs. recall graph. These are just plots of the precision vs the recall as the number of items returned are varied:

Image taken from the following tutorial (which is worth reading): http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html

Now to approximate a precision vs recall for your algorithm, here is what you can do. First, return a large set of say n, results as ranked by your algorithm. Next, get the user to mark which items they actually liked out of those n results. This trivially gives us enough information to compute the precision at every partial set of documents < n (since we know the number). We can also compute the recall (as restricted to this set of documents) by taking the total number of items liked by the user in the entire set. This, we can plot a precision recall curve for this data. Now there are fancier statistical techniques for estimating this using less work, but I have already written enough. For more information please check out the links in the body of my answer.