Confluence 2.7 Temp Archive : Text Tokenization and Filtering
This page last changed on Jun 17, 2007 by rosie@atlassian.com.
Confluence splits the text of content into tokens, and then filters and modifies those tokens according to the following rules. TokenizationThis uses the Lucene Standard Tokenizer. This splits the text into tokens thus:
Note that this means that the string 'foo-bar5' won't be split into 'foo' and 'bar5', so a search for 'bar5' or 'bar*' will not find any results. FilteringConfluence then removes "'s" from the ends of words and removes the dots from acronyms, i.e. I.B.M. becomes IBM. Everything is converted to lower case and common words like 'the' and 'or' are removed. Finally words are stemmed, so that 'fishing' and 'fishes', for example, both become 'fish'. |
![]() |
Document generated by Confluence on Dec 20, 2007 19:03 |