When searching for content based on search terms entered by the user, Confluence splits the text of the content into tokens, and then filters and modifies those tokens according to the following rules.
Tokenisation
Confluence uses Lucene's Standard Tokenizer. This splits the text into tokens as follows:
- Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by white space is considered part of a token.
- Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
- Recognises email addresses and internet host names as one token.
An example: The string 'foo-bar5' won't be split into 'foo' and 'bar5', so a search for 'bar5' or 'bar*' will not find any results.
Filtering
Confluence then:
- Removes "'s" from the ends of words.
- Removes the dots from acronyms, e.g. I.B.M. becomes IBM.
- Converts everything to lower case.
- Removes common words like 'the' and 'or' are removed.
- Converts words to their stems. For example, 'fishing' and 'fishes' both become 'fish'.