Indexing CJK and strip HTML tags

I am using eZ Find, a front end of eZ Publish to solr, to index some fields containing japanese text and html tags.

I modified the text analyzer as below in schema.xml:

<fieldType name="text" class="solr.TextField">
    <analyzer>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.CJKTokenizerFactory"/>
    </analyzer>
</fieldType>

if for example my custom field is containing:

<h1>ほげほげ<h1>
<p>すもももももももものうち</p>

and I search for すもも in the solr admin, html tags are in the result:

<str name="attr_free_1_t"><h1>ほげほげ<h1><p>すもももももももものうち</p></str>

How could I prevent the HTML tags from getting indexed?

Thanks in advance.

Solution

By Using solr.HTMLStripCharFilterFactory, you could only stop the HTML tags from being "Indexed" but not from being "Stored".

In other words, you will get results for "すもももももももものうち" (Of course with HTML tags), but not for "<p>すもももももももものうち</p>".

Note: The asumption is that you dont strip off html tags during searching.

If you don't want these HTML tags to be indexed, you can use solr.PatternReplaceCharFilterFactory.

Your configuration may look like,

    <analyzer>
        <charFilter class="solr.PatternReplaceCharFilterFactory" 
                    pattern="Your regular expression to match HTML tags" 
                    replacement=""/>
        <tokenizer class="solr.CJKTokenizerFactory"/>
    </analyzer>