I'm currently developing a small web search engine but I'm not sure how am I going evaluate it. I understand that a search engine can be evaluated by its precision and recall. In a more "localized" information retrieval system, e.g., an e-library, I can calculate both of them because I can know which stuffs are relevant to my query. But in a web-based information retrieval system, e.g., Google, it would be impossible to calculate the recall because I do not know how many web pages are relevant. This should means that F-measure and other measurements that require the number of relevant pages cannot be done.
Is everything I wrote correct? Is web search engine evaluation limited to precision only? Are there any other measurements I could use to evaluate a web search engine (other than P@k)?
You're correct that precision and recall, along with the F score / F measure are commonly-used metrics for evaluating (unranked) retrieval sets in search engine performance.
And you're also correct about the difficult or impossible nature of determining recall and precision scores for huge corpus of data such as all the web pages on the entire internet. For all search engines, small or large, I would argue that it's important to consider the role of human interaction in information retrieval: are the users using the search engine more interested in having a (ranked) list of relevant results that answers their information need or would one "top" relevant result be enough to satisfy the user's information needs? Check out the concept of "satisficing" as it pertains to information seeking for more information about how users assess when their information needs are met.
Whether you use precision, recall, mean-average precision, mean reciprocal rank, or any other of the numerous relevance and retrieval metrics it really depends on what you're trying to assess with regard to the quality of your search engine's results. I'd first try to figure out what sort of 'information needs' the users of my small search engine might have: will they be looking for a selection of relevant documents or would it be more helpful for their query needs if they had one 'best' document to satisfy their information needs? If you can better understand how your users will be using your small search engine you can then use that information to help inform which relevance model(s) will give your users results that they deem to be most useful for their information-seeking needs.