This page last changed on Dec 21, 2009 by ggaskell.

Sometimes a user can experience problems indexing large MSExcel or MSPowerPoint documents and the reindexing may cause potential Unknown Ptg warning messages that are harmless. There is already a request to Suppress these warnings from the re-indexing of unreadable documents by the POI library.

The error is usually not serious yet can sometimes cause problems when large attachments are used. So you may like to disable indexing of a particular type of document.

To do this, you can use one of the methods described below.

Method 1: Using the Administration Console

You can disable the relevant modules from the Attachment Extractors or Office Connector plugins, by going to Administration -> Configuration -> Plugins and disabling the relevant plugin modules:

  • To disable the indexing of PDF attachments, go to the Attachment Extractors plugin and disable the following module:
    • PDF Content Extractor — For PDF attachments

  • To disable the indexing of Office attachments, go to the Office Connector plugin and disable the following modules as required:
    • Word Content Extractor — For Word 97/2007 (.doc and .docx) attachments
    • PowerPoint 97 Content Extractor — For PowerPoint 97 (.ppt) attachments
    • PowerPoint 2007 Content Extractor — For PowerPoint 2007 (.pptx) attachments
    • Excel 97 Content Extractor — For Excel 97 (.xls) attachments
    • Excel 2007 Content Extractor — For Excel 2007 (.xlsx) attachments
The search query will ignore all attachments of the type corresponding to the disabled module.

Method 2: Editing the atlassian-plugin.xml files of plugins

You need to modify the content of the atlassian-plugin.xml file in the following JAR files and comment out the relevant file type extractor:

  • confluence-attachment-extractors-x.x.jar (for PDF) or
  • OfficeConnector-x.x.jar (for Office files)

Both of these JAR files are located in the confluence\WEB-INF\classes\classes\com\atlassian\confluence\setup\atlassian-bundled-plugins.zip file.

If you are unfamiliar with modifying JAR files, please refer to the Editing Files within JAR Archives document for further information.

You can identify file type extractors in atlassian-plugin.xml files by the occurrence of ContentExtractor in their key attribute.

Once the ContentExtractor for a file type is disabled, all files of that type become unsearchable.

The example below shows a pdfContentExtractor disabled which would prevent PDF attachments from being indexed.

<atlassian-plugin key="com.atlassian.confluence.plugins.attachmentExtractors" name="Attachment Extractors">
    <plugin-info>
        <description>This plugin extracts searchable text from various attachment types.</description>
        <version>1.1</version>
        <vendor name="Atlassian Pty Ltd" url="http://www.atlassian.com/"/>
    </plugin-info>

    <!--
    <extractor name="PDF Content Extractor" key="pdfContentExtractor" class="com.atlassian.bonnie.search.extractor.PdfContentExtractor" priority="1100">
        <description>Indexes contents of PDF files</description>
    </extractor>
    -->

</atlassian-plugin>

The following table shows the file type extractors in the atlassian-plugin.xml of the OfficeConnector-x.x.jar file, which require commenting out to prevent indexing:

Type of attachment File Type Extractor
Word 97/2007 (.doc and .docx)
<extractor name="Word Content Extractor" key="wordContentExtractor" class="com.atlassian.confluence.extra.officeconnector.index.word.WordTextExtractor" priority="1099">
    <description>Indexes contents of Word 97/2007 files</description>
</extractor>
PowerPoint 97 (.ppt)
<extractor name="PowerPoint 97 Content Extractor" key="ppt97ContentExtractor" class="com.atlassian.confluence.extra.officeconnector.index.powerpoint.PowerPointTextExtractor" priority="1099">
    <description>Indexes contents of PowerPoint 97 files</description>
</extractor>
PowerPoint 2007 (.pptx)
<extractor name="PowerPoint 2007 Content Extractor" key="ppt2k7ContentExtractor" class="com.atlassian.confluence.extra.officeconnector.index.powerpoint.PowerPointXMLTextExtractor" priority="1099">
    <description>Indexes contents of PowerPoint 2007 files</description>
</extractor>
Excel 97 (.xls)
<extractor name="Excel 97 Content Extractor" key="excel97ContentExtractor" class="com.atlassian.confluence.extra.officeconnector.index.excel.ExcelTextExtractor" priority="1099">
    <description>Indexes contents of Excel 97 files</description>
</extractor>
Excel 2007 (.xlsx)
<extractor name="Excel 2007 Content Extractor" key="excel2k7ContentExtractor" class="com.atlassian.confluence.extra.officeconnector.index.excel.ExcelXMLTextExtractor" priority="1099">
    <description>Indexes contents of Excel 2007 files</description>
</extractor>
Document generated by Confluence on Jul 09, 2010 01:11