This page last changed on Jun 18, 2007 by rosie@atlassian.com.

Confluence splits the text of content into tokens, and then filters and modifies those tokens according to the following rules.

Tokenization

This uses the Lucene Standard Tokenizer. This splits the text into tokens thus:

  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

Note that this means that the string 'foo-bar5' won't be split into 'foo' and 'bar5', so a search for 'bar5' or 'bar*' will not find any results.

Filtering

Confluence then removes "'s" from the ends of words and removes the dots from acronyms, i.e. I.B.M. becomes IBM. Everything is converted to lower case and common words like 'the' and 'or' are removed. Finally words are stemmed, so that 'fishing' and 'fishes', for example, both become 'fish'.

Document generated by Confluence on Oct 10, 2007 18:50