Consider the analysis of the 1st sentence from Wikipedia page of Albert Einstein:
and its output:
Question: Is there any way to get this in some semi-strictured way from solr? Ultimately, I am interesting in referencing the character sequences from the original text to the exact tokens of the last line..
The web interface in Solr is a thin HTML/Javascript application that works by making calls back into Solr's REST interface to perform any actual work. If you watch the network tab in your browser when you ask the web interface to perform analysis, you can see that it's making a request to:
http://localhost:8080/solr/corename/analysis/field?wt=json&analysis.showmatch=true&analysis.fieldvalue=foo%20bar&analysis.query=foo%20bar&analysis.fieldtype=text_no
And the response is a JSON structure used to build the UI you see:
{
"responseHeader":{
"status":0,
"QTime":108
},
"analysis":{
"field_types":{
"text_no":{
"index":[
"org.apache.lucene.analysis.standard.StandardTokenizer",
[
{
"text":"foo",
"raw_bytes":"[66 6f 6f]",
"match":true,
"start":0,
"end":3,
"org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength":1,
"type":"<ALPHANUM>",
"position":1,
"positionHistory":[
1
]
},
{
"text":"bar",
"raw_bytes":"[62 61 72]",
"match":true,
"start":4,
"end":7,
"org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute#positionLength":1,
"type":"<ALPHANUM>",
"position":2,
"positionHistory":[
2
]
}
],
// .....
],
"query":[
"org.apache.lucene.analysis.standard.StandardTokenizer",
[
// ....
]
]
}
},
"field_names":{
}
}
}
You can then iterate through the index
or query
keys and pick the entries you need (last/first/etc.)
The URL and response format may have changed between Solr versions, but I'm pretty sure it's been stable for the last major versions.