Search code examples
c#filesearchscriptingsearch-engine

Text files to test the functionality of a search engine


In the purpose of practicing for an upcoming programming contest, I'm making a very basic search engine in C# that takes a query from the user (e.g. "Markov Decision Process") and searches through a couple of files to find the most relevant one to the query.

The application seems to be working (I used a term-document matrix algorithm).

But now I'd like to test the functionality of the search engine to see if it really is working properly. I tried to take a couple of Wikipedia articles and saving them as .txt files and testing it out, but I just can't see if it's working fast enough (even with some timers).

My question is, is there a website that shows a couple of files to test a search engine on (along with the logically expected result)?

I'm testing with common sense so far, but it would be great to be sure of my results.

Also, how can I get a collection of .txt files (maybe 10 000+ files) about various subjects to see if my application runs fast enough?

I tried copying a few Wikipedia articles, but it would take way too much time to do. I also thought about making a script of some sort to do it for me, but I really don't know how to do that.

So, where can I find a lot of files with separated subjects?

Otherwise, how can I benchmark my application?

Note: I guess a simple big .txt file where each line represents a "file" about a subject would do the job too.


Solution

  • You can get wikipedia pages by using a recursive function and loading the html from every page linked to by one set page.

    if you have some experience with c# this should help you: http://www.csharp-station.com/HowTo/HttpWebFetch.aspx

    then loop through the text and collect all the instances of the text: "<a href=\"" and recursively call that method. You should also use a counter to limit the number of recursions.

    Also, to prevent OutOfMemory exceptions you should stop the method when it reaches multiples of some number of iterations and write everything to a file. Then flush the old data from a string