FTR - Full Text Retrieval


  
 Applies To 
  
 Product(s):eB Server
 Version(s):All
 Environment: N/A
 Area: FTR
 Subarea: N/A
 Original Author:Rich Thomas, Bentley TSG
  

 

 

 

 

 

 

 

 

 

Firstly, to eliminate any confusion, Full Text Retrieval, or FTR, is the ability to search the content of electronic files that are added to eB.

The files are ‘read’ by the Microsoft Indexing Service, and dependent upon the file type, may require the assistance of iFilters.  The information gathered is stored in an indexing service catalog, which eB can search.

So, firstly, which files can be FTR’ed?

Please run the following SQL:

SELECT f.file_ext format, e.engine_name engine
FROM m3_supported_ftr_formats sff
JOIN m3_formats f ON f.format_id = sff.format_id
JOIN m3_engines e ON e.engine_id = sff.engine_id
ORDER BY format

You will see a list similar to the following: 

There are 3 engines available in versions prior to 15.6.1.  They are the following:

FTR – The file is assumed to be readable by the indexing service (with or without a relevant iFilter), and the file is added to an FTR repository.

OCR – the file may be an image file which cannot be read by the Microsoft Indexing Service, so it must first go through an OCR process first.  A resultant txt file is added to the FTR repository which can be read by the Microsoft Indexing Service.

SPICER  - The OCR process can only scan certain file types.  Some file types which cannot be read by the Microsoft Indexing Service, first need to be converted to a file type that can be OCR’ed.  So a DGN file will first be converted to a TIFF by the Spicer engine, before the OCR engine converts the TIFF to a txt file.  The resultant txt file is added to the FTR repository which can be read by the Microsoft Indexing Service.

In my experience, the defaults that are set in the database are correct, and there is only one consideration to be made.  And that relates to PDF’s.  There are two types of PDF’s.  Those that are generated by converting from a
Word file, for example, and those that originate from scanners, and are essentially image files with a PDF wrapper.

The difference being that the scanned PDF’s cannot be read by PDF iFilters – after all there is nothing to read – it is an image.  The fact that the image contains word is irrelevant.  So if you have scanned PDF’s in order to be able to perform FTR searches on them, you will have to put them through an OCR engine first, there by converting them to txt files.

NOTE:  There is no way that scanned PDF’s can be OCR’ed, and readable PDF’s can simply be added to the FTR repository.  It is based purely on the file extension, and there can only be one rule for each file extension.

To change the default behavior of PDFs, please run the
following SQL:

update m3_supported_ftr_formats
set engine_id = (Select engine_id from m3_engines where engine_name = 'OCR')
where format_id = (select format_id from m3_formats where file_ext = '.PDF')

 

The last point to note is that FTR is based on document classes.  So file types that are in the list above, that belong to a specified document class, will be submitted for FTR.