We have a problem that at the moment we are not allowed to use ElasticSearch, so we need to implement a search function with MySQL. One desired feature is a prefixed, tokenized search, so a sentence like
"The quick brown fox jumped over the lazy dog" could be findable when you search for "jump". I think I would need to define a rule like (pseudocode):
(*)(beginning OR whitespace)(prefix)(*)
I assume it is possible to do that with JPA (Criteria API)? But what if we have two terms? All of them have to be combined by AND, e.g. the above rule should result in TRUE for both terms in at least one column. That means "jump fox" would result in a hit, but "jump rabbit" would not. Is that also possible with Criteria API?
Or do you know a better solution than Criteria API? I heard Hibernate can do LIKE queries more elegantly (with less code) but unfortunately we use EclipseLink.
Based on the answer below here is my full solution. It's all in one method to keep it simple here ("simple JPA criteria API" is an oxymoron though). If anyone wants to use it, consider some refactoring
public List<Customer> findMatching(String searchPhrase) {
List<String> searchTokens = TextService.splitPhraseIntoNonEmptyTokens(searchPhrase);
if (searchTokens.size() < 1 || searchTokens.size() > 5) { // early out and denial of service attack prevention
return new ArrayList<>();
}
CriteriaBuilder criteriaBuilder = entityManager.getCriteriaBuilder();
CriteriaQuery<Customer> criteriaQuery = criteriaBuilder.createQuery(Customer.class);
Root<Customer> rootEntity = criteriaQuery.from(Customer.class);
Predicate[] orClausesArr = new Predicate[searchTokens.size()];
for (int i = 0; i < searchTokens.size() ; i++) {
// same normalization methods are used to create the indexed searchable data
String assumingKeyword = TextService.normalizeKeyword(searchTokens.get(i));
String assumingText = TextService.normalizeText(searchTokens.get(i));
String assumingPhoneNumber = TextService.normalizePhoneNumber(searchTokens.get(i));
String assumingKeywordInFirstToken = assumingKeyword + '%';
String assumingTextInFirstToken = assumingText + '%';
String assumingPhoneInFirstToken = assumingPhoneNumber + '%';
String assumingTextInConsecutiveToken = "% " + assumingText + '%';
Predicate query = criteriaBuilder.or(
criteriaBuilder.like(rootEntity.get("normalizedCustomerNumber"), assumingKeywordInFirstToken),
criteriaBuilder.like(rootEntity.get("normalizedPhone"), assumingPhoneInFirstToken),
criteriaBuilder.like(rootEntity.get("normalizedFullName"), assumingTextInFirstToken),
// looking for a prefix after a whitespace:
criteriaBuilder.like(rootEntity.get("normalizedFullName"), assumingTextInConsecutiveToken)
);
orClausesArr[i] = query;
}
criteriaQuery = criteriaQuery
.select(rootEntity) // you can also select only the display columns and ignore the normalized/search columns
.where(criteriaBuilder.and(orClausesArr))
.orderBy(
criteriaBuilder.desc(rootEntity.get("customerUpdated")),
criteriaBuilder.desc(rootEntity.get("customerCreated"))
);
try {
return entityManager
.createQuery(criteriaQuery)
.setMaxResults(50)
.getResultList();
} catch (NoResultException nre) {
return new ArrayList<>();
}
}
The Criteria API is certainly not intended for this but it can be used to create LIKE predicates.
So for each search term and each column you want to search you would create something like the following:
column like :term + '%'
or column like ' ' + :term + '%'
or column like ',' + :term + '%'
// repeat for all other punctuation marks and forms of whitespace you want to support.
This will create horribly inefficient queries!
I see the following alternatives:
Use database specific features. Some databases have some text search capabilities. If you can limit your application to one or few databases that might work.
Create your own index: Use a proper tokenizer to analyze the columns you want to search and put the resulting tokens in a separate table with backreferences to the original table. Now search that for the terms you are looking for. As long as you do only prefixed searches database indexes should be able to keep this reasonable efficient and it is easier to maintain and more flexible than what you can obtain by using the Criteria API on its own.