public class ICUTokenizerFactory extends TokenizerFactory implements ResourceLoaderAware
ICUTokenizer
.
Words are broken across script boundaries, then segmented according to
the BreakIterator and typing provided by the DefaultICUTokenizerConfig
.
To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> </analyzer> </fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true" rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/> </analyzer> </fieldType>
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
Constructor and Description |
---|
ICUTokenizerFactory(Map<String,String> args)
Creates a new ICUTokenizerFactory
|
Modifier and Type | Method and Description |
---|---|
ICUTokenizer |
create(AttributeFactory factory)
Creates a TokenStream of the specified input using the given AttributeFactory
|
void |
inform(ResourceLoader loader)
Initializes this component with the provided ResourceLoader
(used for loading classes, files, etc).
|
availableTokenizers, create, forName, lookupClass, reloadTokenizers
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitFileNames
public ICUTokenizerFactory(Map<String,String> args)
public void inform(ResourceLoader loader) throws IOException
ResourceLoaderAware
inform
in interface ResourceLoaderAware
IOException
public ICUTokenizer create(AttributeFactory factory)
TokenizerFactory
create
in class TokenizerFactory