In my Webmethods application I need to implement a Search functionality and I had done it with Lucene. But the search is not retrieving results when I am searching for file with title ending in something other than alpabet.for eg:- doc1.txt or new$.txt
In the code below when i try to print queryCmbd its printing Search Results>>>>>>>title:"doc1 txt" (contents:doc1 contents:txt).when I search for a string like doc.txt, the result is Search Results>>>>>>>title:"doc.txt" contents:doc.txt. What should be done in order to parse these kinds of strings(like doc1.txt,new$.txt)?
public java.util.ArrayList<DocNames> searchIndex(String querystr,
String path, StandardAnalyzer analyzer) {
String FIELD_CONTENTS = "contents";
String FIELD_TITLE = "title";
String queryStringCmbd = null;
queryStringCmbd = new String();
String queryFinal = new String(querystr.replaceAll(" ", " AND "));
queryStringCmbd = FIELD_TITLE + ":\"" + queryFinal + "\" OR "
+ queryFinal;
try {
FSDirectory directory = FSDirectory.open(new File(path));
Query q = new QueryParser(Version.LUCENE_36, FIELD_CONTENTS,
analyzer).parse(querystr);
Query queryCmbd = new QueryParser(Version.LUCENE_36,
FIELD_CONTENTS, analyzer).parse(queryStringCmbd);
int hitsPerPage = 10;
IndexReader indexReader = IndexReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
TopScoreDocCollector collector = TopScoreDocCollector.create(
hitsPerPage, true);
indexSearcher.search(queryCmbd, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
System.out
.println("Search Results>>>>>>>>>>>>"
+ queryCmbd);
docNames = new ArrayList<DocNames>();
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = indexSearcher.doc(docId);
DocNames doc = new DocNames();
doc.setIndex(i + 1);
doc.setDocName(d.get("title"));
doc.setDocPath(d.get("path"));
if (!(d.get("path").contains("indexDirectory"))) {
docNames.add(doc);
}
}
indexReader.flush();
indexReader.close();
indexSearcher.close();
return docNames;
} catch (CorruptIndexException e) {
closeIndex(analyzer);
e.printStackTrace();
return null;
} catch (IOException e) {
closeIndex(analyzer);
e.printStackTrace();
return null;
} catch (ParseException e) {
closeIndex(analyzer);
e.printStackTrace();
return null;
}
}
Your problem comes from the fact you're using StandardAnalyzer
. If you read its javadoc, it tells that it's using StandardTokenizer
for token splitting. This means phrases like doc1.txt
will be split into doc1
and txt
.
If you want to match the entire text, you need to use KeywordAnalyzer
- both for indexing and searching. The code below displays the difference: using StandardAnalyzer
tokens are {"doc1", "txt"}
and using KeywordAnalyzer
the only token is doc1.txt
.
String foo = "foo:doc1.txt";
StandardAnalyzer sa = new StandardAnalyzer(Version.LUCENE_34);
TokenStream tokenStream = sa.tokenStream("foo", new StringReader(foo));
while (tokenStream.incrementToken()) {
System.out.println(tokenStream.getAttribute(TermAttribute.class).term());
}
System.out.println("-------------");
KeywordAnalyzer ka = new KeywordAnalyzer();
TokenStream tokenStream2 = ka.tokenStream("foo", new StringReader(foo));
while (tokenStream2.incrementToken()) {
System.out.println(tokenStream2.getAttribute(TermAttribute.class).term());
}