To be able to search the text efficiently and effectively, Solr (mostly Lucene actually) splits the text into tokens during indexing as well as during query (search). Those tokens can also be pre- and post-filtered for additional flexibility. This allows for things like case-insensitive search, misspelt product names, synonyms, and so on.
To achieve all this flexibility, Solr comes quite a variety of methods to manipulate the text. Understanding what filters and tokenizers are available and what they actually do is a major stumbling block for new Solr users. This page provides a comprehensive overview of all the classes that can be used in Solr, together with the link to their Javadoc pages.
Most of the analyzers, tokenizers and filters are located in lucene-analyzers-common-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ), so any entry without a location indicated can be found in that jar.
Note: all of this is only applicable to the text fields with fieldType's class solr.TextField. If your fieldType's class is solr.StrField, it does not get analyzed (similar to using plain KeywordTokenizerFactory).
The set below are the analyzers that are standalone. They take in text and out comes a sequence of tokens. The same analyzer is used during indexing and during search. Many of these come from Lucene itself. Only analyzers that can be used by Solr are listed here. Lucene has some other analyzers that cannot be used directly because they have non-standard initialization requirements.
<fieldType name="text_greek" class="solr.TextField"> <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/> </fieldType>
Analyzer in lucene-core-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )An Analyzer builds TokenStreams, which analyze text.
AnalyzerWrapper in lucene-core-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )Extension to Analyzer suitable for Analyzers which wrap other Analyzers.
ShingleAnalyzerWrapperA ShingleAnalyzerWrapper wraps a ShingleFilter around another Analyzer.
DutchAnalyzerAnalyzer for Dutch language.
KeywordAnalyzer"Tokenizes" the entire stream as a single token.
MorfologikAnalyzer in lucene-analyzers-morfologik-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ )org.apache.lucene.analysis.Analyzer using Morfologik library.
SimpleAnalyzerAn Analyzer that filters LetterTokenizer with LowerCaseFilter
SmartChineseAnalyzer in lucene-analyzers-smartcn-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ )SmartChineseAnalyzer is an analyzer for Chinese or mixed Chinese-English text.
StopwordAnalyzerBase in lucene-core-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )Base class for Analyzers that need to make use of stopword sets.
ArabicAnalyzerAnalyzer for Arabic.
ArmenianAnalyzerAnalyzer for Armenian.
BasqueAnalyzerAnalyzer for Basque.
BrazilianAnalyzerAnalyzer for Brazilian Portuguese language.
BulgarianAnalyzerAnalyzer for Bulgarian.
CatalanAnalyzerAnalyzer for Catalan.
CJKAnalyzerAn Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter
ClassicAnalyzerFilters ClassicTokenizer with ClassicFilter, LowerCaseFilter and StopFilter, using a list of English stop words.
CzechAnalyzerAnalyzer for Czech language.
DanishAnalyzerAnalyzer for Danish.
EnglishAnalyzerAnalyzer for English.
FinnishAnalyzerAnalyzer for Finnish.
FrenchAnalyzerAnalyzer for French language.
GalicianAnalyzerAnalyzer for Galician.
GermanAnalyzerAnalyzer for German language.
GreekAnalyzerAnalyzer for the Greek language.
HindiAnalyzerAnalyzer for Hindi.
HungarianAnalyzerAnalyzer for Hungarian.
IndonesianAnalyzerAnalyzer for Indonesian (Bahasa)
IrishAnalyzerAnalyzer for Irish.
ItalianAnalyzerAnalyzer for Italian.
JapaneseAnalyzer in lucene-analyzers-kuromoji-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )Analyzer for Japanese that uses morphological analysis.
LatvianAnalyzerAnalyzer for Latvian.
LithuanianAnalyzerAnalyzer for Lithuanian.
NorwegianAnalyzerAnalyzer for Norwegian.
PersianAnalyzerAnalyzer for Persian.
PolishAnalyzer in lucene-analyzers-stempel-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ )Analyzer for Polish.
PortugueseAnalyzerAnalyzer for Portuguese.
RomanianAnalyzerAnalyzer for Romanian.
RussianAnalyzerAnalyzer for Russian language.
SoraniAnalyzerAnalyzer for Sorani Kurdish.
SpanishAnalyzerAnalyzer for Spanish.
StandardAnalyzer in lucene-core-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words.
StopAnalyzerFilters LetterTokenizer with LowerCaseFilter and StopFilter.
SwedishAnalyzerAnalyzer for Swedish.
ThaiAnalyzerAnalyzer for Thai language.
TurkishAnalyzerAnalyzer for Turkish.
UAX29URLEmailAnalyzerFilters org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer with org.apache.lucene.analysis.standard.StandardFilter, org.apache.lucene.analysis.LowerCaseFilter and org.apache.lucene.analysis.StopFilter, using a list of English stop words.
UkrainianMorfologikAnalyzer in lucene-analyzers-morfologik-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ )A dictionary-based Analyzer for Ukrainian.
UnicodeWhitespaceAnalyzerAn Analyzer that uses UnicodeWhitespaceTokenizer.
WhitespaceAnalyzerAn Analyzer that uses WhitespaceTokenizer.
A more flexible approach than a single all-encompassing tokenizer is to chain and configure some tokenizers and filters together to fit particular customer requirements. Solr allows to have up to three type of components in the chain:
CharFilterFactoryAbstract parent class for analysis factories that create CharFilter instances.
HTMLStripCharFilterFactory (Sample mentions: solr-1 )A CharFilter that wraps another Reader and attempts to strip out HTML constructs.
ICUNormalizer2CharFilterFactory (multi) in lucene-analyzers-icu-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ )Normalize token text with ICU's Normalizer2.
JapaneseIterationMarkCharFilterFactory (multi) in lucene-analyzers-kuromoji-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )Normalizes Japanese horizontal iteration marks (odoriji) to their expanded form.
MappingCharFilterFactory (multi) Simplistic CharFilter that applies the mappings contained in a NormalizeCharMap to the character stream, and correcting the resulting changes to the offsets.
PatternReplaceCharFilterFactory (Sample mentions: indexing-book-1 solr-in-action-book-1 )CharFilter that uses a regular expression for the target of replace string.
PersianCharFilterFactory (multi) (Sample mentions: solr-1 )CharFilter that replaces instances of Zero-width non-joiner with an ordinary space.
TokenizerFactoryAbstract parent class for analysis factories that create Tokenizer instances.
ClassicTokenizerFactoryA grammar-based tokenizer constructed with JFlex
EdgeNGramTokenizerFactoryCreates new instances of EdgeNGramTokenizer.
HMMChineseTokenizerFactory in lucene-analyzers-smartcn-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ )Tokenizer for Chinese or mixed Chinese-English text.
ICUTokenizerFactory in lucene-analyzers-icu-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ ) (Sample mentions: typo3-1 )Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)
JapaneseTokenizerFactory in lucene-analyzers-kuromoji-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ) (Sample mentions: solr-1 )Tokenizer for Japanese that uses morphological analysis.
KeywordTokenizerFactory (Sample mentions: solr-1 )Emits the entire input as a single token.
LetterTokenizerFactoryA LetterTokenizer is a tokenizer that divides text at non-letters.
LowerCaseTokenizerFactory (multi) LowerCaseTokenizer performs the function of LetterTokenizer and LowerCaseFilter together.
NGramTokenizerFactoryTokenizes the input into n-grams of the given size(s).
PathHierarchyTokenizerFactory (Sample mentions: solr-1 blacklight-1 )Tokenizer for path-like hierarchies.
PatternTokenizerFactory (Sample mentions: solr-in-action-book-1 )This tokenizer uses regex pattern matching to construct distinct tokens for the input stream.
StandardTokenizerFactory (Sample mentions: solr-1 )A grammar-based tokenizer constructed with JFlex.
ThaiTokenizerFactory (Sample mentions: solr-1 )Tokenizer that use BreakIterator to tokenize Thai text.
UAX29URLEmailTokenizerFactory (Sample mentions: solr-1 )This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
UIMAAnnotationsTokenizerFactory in lucene-analyzers-uima-6.4.0.jar ( contrib/uima/lucene-libs/ )org.apache.lucene.analysis.util.TokenizerFactory for UIMAAnnotationsTokenizer
UIMATypeAwareAnnotationsTokenizerFactory in lucene-analyzers-uima-6.4.0.jar ( contrib/uima/lucene-libs/ )org.apache.lucene.analysis.util.TokenizerFactory for UIMATypeAwareAnnotationsTokenizer
WhitespaceTokenizerFactory (Sample mentions: solr-1 )A tokenizer that divides text at whitespace characters as defined by Character#isWhitespace(int).
WikipediaTokenizerFactoryExtension of StandardTokenizer that is aware of Wikipedia syntax.
TokenFilterFactoryAbstract parent class for analysis factories that create org.apache.lucene.analysis.TokenFilter instances.
ApostropheFilterFactory (Sample mentions: solr-1 )Strips all characters after an apostrophe (including the apostrophe itself).
ArabicNormalizationFilterFactory (multi) (Sample mentions: solr-1 )A TokenFilter that applies ArabicNormalizer to normalize the orthography.
ArabicStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies ArabicStemmer to stem Arabic words..
ASCIIFoldingFilterFactory (multi) (Sample mentions: solr-in-action-book-1 )This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.
BaseManagedTokenFilterFactory in solr-core-6.4.0.jar ( dist/ )Abstract based class for implementing TokenFilterFactory objects that are managed by the REST API.
ManagedStopFilterFactory in solr-core-6.4.0.jar ( dist/ ) (Sample mentions: solr-1 typo3-1 typo3-2 typo3-3 typo3-4 typo3-5 typo3-6 typo3-7 typo3-8 typo3-9 typo3-10 typo3-11 typo3-12 typo3-13 typo3-14 typo3-15 typo3-16 typo3-17 typo3-18 typo3-19 typo3-20 typo3-21 typo3-22 typo3-23 typo3-24 typo3-25 typo3-26 typo3-27 typo3-28 typo3-29 typo3-30 typo3-31 typo3-32 typo3-33 typo3-34 typo3-35 )TokenFilterFactory that uses the ManagedWordSetResource implementation for managing stop words using the REST API.
ManagedSynonymFilterFactory in solr-core-6.4.0.jar ( dist/ ) (Sample mentions: solr-1 typo3-1 typo3-2 typo3-3 typo3-4 typo3-5 typo3-6 typo3-7 typo3-8 typo3-9 typo3-10 typo3-11 typo3-12 typo3-13 typo3-14 typo3-15 typo3-16 typo3-17 typo3-18 typo3-19 typo3-20 typo3-21 typo3-22 typo3-23 typo3-24 typo3-25 typo3-26 typo3-27 typo3-28 typo3-29 typo3-30 typo3-31 typo3-32 typo3-33 typo3-34 typo3-35 typo3-36 typo3-37 typo3-38 typo3-39 typo3-40 )TokenFilterFactory and ManagedResource implementation for doing CRUD on synonyms using the REST API.
BeiderMorseFilterFactory in lucene-analyzers-phonetic-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )TokenFilter for Beider-Morse phonetic encoding.
BrazilianStemFilterFactory (Sample mentions: typo3-1 )A TokenFilter that applies BrazilianStemmer.
BulgarianStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies BulgarianStemmer to stem Bulgarian words.
CapitalizationFilterFactoryA filter to apply normal capitalization rules to Tokens.
CJKBigramFilterFactory (Sample mentions: solr-1 typo3-1 )Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.
CJKWidthFilterFactory (multi) (Sample mentions: solr-1 )A TokenFilter that normalizes CJK width differences:
ClassicFilterFactoryNormalizes tokens extracted with ClassicTokenizer.
CodepointCountFilterFactoryRemoves words that are too long or too short from the stream.
CommonGramsFilterFactoryConstructs a CommonGramsFilter.
CommonGramsQueryFilterFactoryConstruct CommonGramsQueryFilter.
CzechStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies CzechStemmer to stem Czech words.
DaitchMokotoffSoundexFilterFactory in lucene-analyzers-phonetic-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )Create tokens for phonetic matches based on Daitch–Mokotoff Soundex.
DateRecognizerFilterFactoryFilters all tokens that cannot be parsed to a date, using the provided DateFormat.
DecimalDigitFilterFactory (multi) Folds all Unicode digits in [:General_Category=Decimal_Number:] to Basic Latin digits (0-9).
DelimitedPayloadTokenFilterFactory (Sample mentions: solr-1 )Characters before the delimiter are the "token", those after are the payload.
DictionaryCompoundWordTokenFilterFactory (Sample mentions: typo3-1 )A org.apache.lucene.analysis.TokenFilter that decomposes compound words found in many Germanic languages.
DoubleMetaphoneFilterFactory in lucene-analyzers-phonetic-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ) (Sample mentions: solr-1 )Filter for DoubleMetaphone (supporting secondary codes)
EdgeNGramFilterFactory (Sample mentions: solr-in-action-book-1 )Creates new instances of EdgeNGramTokenFilter.
ElisionFilterFactory (multi) (Sample mentions: solr-1 solr-2 solr-3 solr-4 typo3-1 )Removes elisions from a TokenStream.
EnglishMinimalStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies EnglishMinimalStemmer to stem English words.
EnglishPossessiveFilterFactory (Sample mentions: solr-1 )TokenFilter that removes possessives (trailing 's) from words.
FingerprintFilterFactoryFilter outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens.
FinnishLightStemFilterFactoryA TokenFilter that applies FinnishLightStemmer to stem Finnish words.
FlattenGraphFilterFactoryConverts an incoming graph token stream, such as one from SynonymGraphFilter, into a flat form so that all nodes form a single linear chain with no side paths.
FrenchLightStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies FrenchLightStemmer to stem French words.
FrenchMinimalStemFilterFactoryA TokenFilter that applies FrenchMinimalStemmer to stem French words.
GalicianMinimalStemFilterFactoryA TokenFilter that applies GalicianMinimalStemmer to stem Galician words.
GalicianStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies GalicianStemmer to stem Galician words.
GermanLightStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies GermanLightStemmer to stem German words.
GermanMinimalStemFilterFactoryA TokenFilter that applies GermanMinimalStemmer to stem German words.
GermanNormalizationFilterFactory (multi) (Sample mentions: solr-1 )Normalizes German characters according to the heuristics of the German2 snowball algorithm.
GermanStemFilterFactoryA TokenFilter that stems German words.
GreekLowerCaseFilterFactory (multi) (Sample mentions: solr-1 )Normalizes token text to lower case, removes some Greek diacritics, and standardizes final sigma to sigma.
GreekStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies GreekStemmer to stem Greek words.
HindiNormalizationFilterFactory (multi) (Sample mentions: solr-1 )A TokenFilter that applies HindiNormalizer to normalize the orthography.
HindiStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies HindiStemmer to stem Hindi words.
HungarianLightStemFilterFactoryA TokenFilter that applies HungarianLightStemmer to stem Hungarian words.
HunspellStemFilterFactoryTokenFilterFactory that creates instances of HunspellStemFilter.
HyphenatedWordsFilterFactoryWhen the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines.
HyphenationCompoundWordTokenFilterFactoryA org.apache.lucene.analysis.TokenFilter that decomposes compound words found in many Germanic languages.
ICUFoldingFilterFactory (multi) in lucene-analyzers-icu-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ ) (Sample mentions: indexing-book-1 )A TokenFilter that applies search term folding to Unicode text, applying foldings from UTR#30 Character Foldings.
ICUNormalizer2FilterFactory (multi) in lucene-analyzers-icu-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ )Normalize token text with ICU's com.ibm.icu.text.Normalizer2
ICUTransformFilterFactory (multi) in lucene-analyzers-icu-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ )A TokenFilter that transforms text with ICU.
IndicNormalizationFilterFactory (multi) (Sample mentions: solr-1 )A TokenFilter that applies IndicNormalizer to normalize text in Indian Languages.
IndonesianStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies IndonesianStemmer to stem Indonesian words.
IrishLowerCaseFilterFactory (multi) (Sample mentions: solr-1 )Normalises token text to lower case, handling t-prothesis and n-eclipsis (i.e., that 'nAthair' should become 'n-athair')
ItalianLightStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies ItalianLightStemmer to stem Italian words.
JapaneseBaseFormFilterFactory in lucene-analyzers-kuromoji-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ) (Sample mentions: solr-1 )Replaces term text with the BaseFormAttribute.
JapaneseKatakanaStemFilterFactory in lucene-analyzers-kuromoji-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ) (Sample mentions: solr-1 )A TokenFilter that normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC).
JapaneseNumberFilterFactory in lucene-analyzers-kuromoji-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )A TokenFilter that normalizes Japanese numbers (kansūji) to regular Arabic decimal numbers in half-width characters.
JapanesePartOfSpeechStopFilterFactory in lucene-analyzers-kuromoji-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ ) (Sample mentions: solr-1 )Removes tokens that match a set of part-of-speech tags.
JapaneseReadingFormFilterFactory in lucene-analyzers-kuromoji-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )A org.apache.lucene.analysis.TokenFilter that replaces the term attribute with the reading of a token in either katakana or romaji form.
KeepWordFilterFactoryA TokenFilter that only keeps tokens with text contained in the required words.
KeywordMarkerFilterFactory (Sample mentions: solr-1 typo3-1 typo3-2 typo3-3 typo3-4 typo3-5 typo3-6 typo3-7 typo3-8 typo3-9 )Marks terms as keywords via the KeywordAttribute.
KeywordRepeatFilterFactoryThis TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with KeywordAttribute#setKeyword(boolean) set to true and once set to false.
KStemFilterFactory (Sample mentions: solr-in-action-book-1 )A high-performance kstem filter for english.
LatvianStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies LatvianStemmer to stem Latvian words.
LengthFilterFactory (Sample mentions: solr-1 solr-2 )Removes words that are too long or too short from the stream.
LimitTokenCountFilterFactoryThis TokenFilter limits the number of tokens while indexing.
LimitTokenOffsetFilterFactoryLets all tokens pass through until it sees one with a start offset <= a configured limit, which won't pass and ends the stream.
LimitTokenPositionFilterFactoryThis TokenFilter limits its emitted tokens to those with positions that are not greater than the configured limit.
LowerCaseFilterFactory (multi) (Sample mentions: solr-1 )Normalizes token text to lower case.
MinHashFilterFactoryTokenFilterFactory for MinHashFilter.
MorfologikFilterFactory in lucene-analyzers-morfologik-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ )Filter factory for MorfologikFilter.
NGramFilterFactoryTokenizes the input into n-grams of the given size(s).
NorwegianLightStemFilterFactoryA TokenFilter that applies NorwegianLightStemmer to stem Norwegian words.
NorwegianMinimalStemFilterFactoryA TokenFilter that applies NorwegianMinimalStemmer to stem Norwegian words.
NumericPayloadTokenFilterFactoryAssigns a payload to a token based on the org.apache.lucene.analysis.Token#type()
PatternCaptureGroupFilterFactoryCaptureGroup uses Java regexes to emit multiple tokens - one for each capture group in one or more patterns.
PatternReplaceFilterFactory (Sample mentions: solr-1 solr-2 solr-3 )A TokenFilter which applies a Pattern to each token in the stream, replacing match occurances with the specified replacement string.
PersianNormalizationFilterFactory (multi) (Sample mentions: solr-1 )A TokenFilter that applies PersianNormalizer to normalize the orthography.
PhoneticFilterFactory in lucene-analyzers-phonetic-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )Create tokens for phonetic matches.
PorterStemFilterFactory (Sample mentions: solr-1 )Transforms the token stream as per the Porter stemming algorithm.
PortugueseLightStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies PortugueseLightStemmer to stem Portuguese words.
PortugueseMinimalStemFilterFactoryA TokenFilter that applies PortugueseMinimalStemmer to stem Portuguese words.
PortugueseStemFilterFactoryA TokenFilter that applies PortugueseStemmer to stem Portuguese words.
RemoveDuplicatesTokenFilterFactory (Sample mentions: solr-1 )A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream.
ReversedWildcardFilterFactory in solr-core-6.4.0.jar ( dist/ ) (Sample mentions: solr-1 )This class produces a special form of reversed tokens, suitable for better handling of leading wildcards.
ReverseStringFilterFactoryReverse token string, for example "country" => "yrtnuoc".
RussianLightStemFilterFactory (Sample mentions: indexing-book-1 )A TokenFilter that applies RussianLightStemmer to stem Russian words.
ScandinavianFoldingFilterFactory (multi) This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o.
ScandinavianNormalizationFilterFactory (multi) This filter normalize use of the interchangeable Scandinavian characters æÆäÄöÖøØ and folded variants (aa, ao, ae, oe and oo) by transforming them to åÅæÆøØ.
SerbianNormalizationFilterFactory (multi) (Sample mentions: typo3-1 )Normalizes Serbian Cyrillic and Latin characters to "bald" Latin.
ShingleFilterFactory (Sample mentions: solr-1 )A ShingleFilter constructs shingles (token n-grams) from a token stream.
SnowballPorterFilterFactory (Sample mentions: solr-1 solr-2 solr-3 solr-4 solr-5 solr-6 solr-7 solr-8 solr-9 solr-10 solr-11 solr-12 solr-13 typo3-1 typo3-2 typo3-3 typo3-4 typo3-5 typo3-6 typo3-7 typo3-8 typo3-9 typo3-10 typo3-11 typo3-12 typo3-13 typo3-14 typo3-15 typo3-16 typo3-17 typo3-18 typo3-19 typo3-20 blacklight-1 )A filter that stems words using a Snowball-generated stemmer.
SoraniNormalizationFilterFactory (multi) (Sample mentions: solr-1 )A TokenFilter that applies SoraniNormalizer to normalize the orthography.
SoraniStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies SoraniStemmer to stem Sorani words.
SpanishLightStemFilterFactory (Sample mentions: solr-1 )A TokenFilter that applies SpanishLightStemmer to stem Spanish words.
StandardFilterFactory (Sample mentions: typo3-1 )Normalizes tokens extracted with StandardTokenizer.
StemmerOverrideFilterFactory (Sample mentions: solr-1 )Provides the ability to override any KeywordAttribute aware stemmer with custom dictionary-based stemming.
StempelPolishStemFilterFactory in lucene-analyzers-stempel-6.4.0.jar ( contrib/analysis-extras/lucene-libs/ ) (Sample mentions: typo3-1 )Transforms the token stream as per the stemming algorithm.
StopFilterFactory (Sample mentions: solr-1 solr-2 solr-3 solr-4 solr-5 solr-6 solr-7 solr-8 solr-9 solr-10 solr-11 solr-12 solr-13 solr-14 solr-15 solr-16 solr-17 solr-18 solr-19 solr-20 solr-21 solr-22 solr-23 solr-24 solr-25 solr-26 solr-27 solr-28 solr-29 solr-30 solr-31 solr-32 solr-33 indexing-book-1 indexing-book-2 blacklight-1 )Removes stop words from a token stream.
SuggestStopFilterFactory in lucene-suggest-6.4.0.jar ( server/solr-webapp/webapp/WEB-INF/lib/ )Like StopFilter except it will not remove the last token if that token was not followed by some token separator.
SwedishLightStemFilterFactoryA TokenFilter that applies SwedishLightStemmer to stem Swedish words.
SynonymFilterFactory (Sample mentions: solr-1 solr-2 )Matches single or multi word synonyms in a token stream.
SynonymGraphFilterFactoryApplies single- or multi-token synonyms from a SynonymMap to an incoming TokenStream, producing a fully correct graph output.
TokenOffsetPayloadTokenFilterFactoryAdds the OffsetAttribute#startOffset() and OffsetAttribute#endOffset() First 4 bytes are the start
TrimFilterFactory (Sample mentions: solr-1 )Trims leading and trailing whitespace from Tokens in the stream.
TruncateTokenFilterFactoryA token filter for truncating the terms into a specific length.
TurkishLowerCaseFilterFactory (multi) (Sample mentions: solr-1 )Normalizes Turkish token text to lower case.
TypeAsPayloadTokenFilterFactoryMakes the org.apache.lucene.analysis.Token#type() a payload.
TypeTokenFilterFactory (Sample mentions: solr-1 indexing-book-1 )Factory class for TypeTokenFilter.
UpperCaseFilterFactory (multi) Normalizes token text to UPPER CASE.
WordDelimiterFilterFactory (Sample mentions: solr-1 solr-2 solr-3 typo3-1 typo3-2 typo3-3 typo3-4 typo3-5 typo3-6 typo3-7 typo3-8 typo3-9 typo3-10 typo3-11 typo3-12 typo3-13 typo3-14 typo3-15 typo3-16 typo3-17 typo3-18 typo3-19 typo3-20 typo3-21 typo3-22 typo3-23 typo3-24 typo3-25 typo3-26 typo3-27 typo3-28 typo3-29 typo3-30 typo3-31 typo3-32 typo3-33 typo3-34 typo3-35 typo3-36 typo3-37 typo3-38 typo3-39 typo3-40 typo3-41 typo3-42 typo3-43 typo3-44 typo3-45 typo3-46 typo3-47 typo3-48 typo3-49 typo3-50 typo3-51 typo3-52 typo3-53 typo3-54 typo3-55 typo3-56 typo3-57 typo3-58 typo3-59 typo3-60 typo3-61 typo3-62 typo3-63 typo3-64 typo3-65 typo3-66 typo3-67 typo3-68 typo3-69 typo3-70 typo3-71 typo3-72 typo3-73 typo3-74 typo3-75 typo3-76 typo3-77 typo3-78 typo3-79 typo3-80 typo3-81 typo3-82 typo3-83 typo3-84 typo3-85 typo3-86 typo3-87 typo3-88 typo3-89 typo3-90 typo3-91 typo3-92 typo3-93 typo3-94 typo3-95 typo3-96 typo3-97 typo3-98 typo3-99 solr-in-action-book-1 solr-in-action-book-2 )Splits words into subwords and performs optional transformations on subword groups.
In Solr, the text is analyzed twice: once when it gets indexed and once it gets queried (searched).
It's possible to define the same chain for both of these phases
<fieldType name="text_fa" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.PersianCharFilterFactory"/> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ArabicNormalizationFilterFactory"/> <filter class="solr.PersianNormalizationFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fa.txt" /> </analyzer> </fieldType>
Alternatively, the analyzis and query chains can be different
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
Finally, there is a third - usually hidden - chain type, which is used for multiterm analysis (queries like term* and [term1..term2]). The reason it is hidden is because it is usually automatically constructed from the explicitly defined chain by only using components that are mutiterm-aware. They are marked with (multi) in the list above. The primary use case is to ensure that case-insensitive matches work as expected even when wildcards are used. You can read more complete explanation in the Solr Wiki.
To use it, add <analyzer type="multiterm">
section next to the index and query sections in the analyzer chain definition.
<tokenizer class="solr.WhitespaceTokenizerFactory"/>However, the analyzer class names have to be provided in full for legacy reasons. Alos, non-core components require full class name, including package name.