In the purpose of practicing for an upcoming programming contest, I'm making a very basic search engine in C# that takes a query from the user (e.g. "Markov Decision Process"
) and searches through a couple of files to find the most relevant one to the query.
The application seems to be working (I used a term-document matrix algorithm).
But now I'd like to test the functionality of the search engine to see if it really is working properly. I tried to take a couple of Wikipedia articles and saving them as .txt
files and testing it out, but I just can't see if it's working fast enough (even with some timers).
My question is, is there a website that shows a couple of files to test a search engine on (along with the logically expected result)?
I'm testing with common sense so far, but it would be great to be sure of my results.
Also, how can I get a collection of .txt
files (maybe 10 000+ files) about various subjects to see if my application runs fast enough?
I tried copying a few Wikipedia articles, but it would take way too much time to do. I also thought about making a script of some sort to do it for me, but I really don't know how to do that.
So, where can I find a lot of files with separated subjects?
Otherwise, how can I benchmark my application?
Note: I guess a simple big .txt
file where each line represents a "file" about a subject would do the job too.
You can get wikipedia pages by using a recursive function and loading the html from every page linked to by one set page.
if you have some experience with c# this should help you: http://www.csharp-station.com/HowTo/HttpWebFetch.aspx
then loop through the text and collect all the instances of the text: "<a href=\""
and recursively call that method. You should also use a counter to limit the number of recursions.
Also, to prevent OutOfMemory exceptions you should stop the method when it reaches multiples of some number of iterations and write everything to a file. Then flush the old data from a string