I'm indexing with Solr Cell a large HTML page using a curl command with a Windows command prompt like so:
curl http://localhost:8987/solr/myexample/update/extract -d @test.html -H 'Content-type:html'
I have found that I'm missing data (text) in my fields when I query (query?q=*:*&q.op=OR&indent=true) them in the admin menu of SOLR. Example: I have a bunch of lorem ipsum <p> tags but near the end of my HTML page I have another paragraph tag of Hello world, this does not show up in SOLR admin.
I found the following on the old wiki.
Large individual fields.
It is possible to store megabytes of text in one record. These fields are clumsy to work with. By default the number of characters stored is clipped.
It does not go into any details on how you would prevent the text from being clipped, that is if this is even what's causing the issue because I can't even get MB worth of data in a field before it's cut.
schema.xml
<field name="main" type="text_general" indexed="true" stored="true"/>
<field name="div" type="text_general" indexed="true" stored="true"/>
<field name="doc_id" type="string" uninvertible="true" indexed="true" stored="true"/>
<field name="date_pub" type="pdate" uninvertible="true" indexed="true" stored="true"/>
<field name="p" type="text_general" uninvertible="true" indexed="true" stored="true"/>
<field name="_text_" type="text_general" indexed="true" stored="true" multiValued="true"/>
<copyField source="*" dest="_text_"/>
solrconfig.xml
<requestHandler name="/update/extract"
class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="fmap.content">content</str>
<str name="capture">div</str>
<str name="fmap.div">div</str>
<str name="capture">h1</str>
<str name="fmap.h1">h1</str>
<str name="capture">h2</str>
<str name="fmap.h2">h2_t</str>
<str name="capture">p</str>
<str name="fmap.p">p</str>
</lst>
</requestHandler>
Solr Version: 8.10.1
SOLR cell doesn't seem to limit the characters, however, and don't ask me why, the culprit was the curl command I was using below:
curl http://localhost:8987/solr/myexample/update/extract -d @test.html -H 'Content-type:html'
Solution: The following command pulls all the text without truncating any of the text (replace paths with wherever your post.jar and HTML file are):
java -jar -Dc=myexample -Dauto example\exampledocs\post.jar example\exampledocs\sample.html
Worth noting these are Window commands for the Command Prompt.