be.re.repo.mod
Class ZippedDocumentTextExtractor

java.lang.Object
  extended by be.re.repo.mod.ZippedDocumentTextExtractor

public class ZippedDocumentTextExtractor
extends Object

Implements a mechanism to extract text from zipped documents containing XML entities. Possible formats are ODF, ePub, Office Open XML, etc. The documents are processed in a streaming-oriented fashion.

Author:
Werner Donné

Nested Class Summary
static interface ZippedDocumentTextExtractor.FilterFactory
           
 
Constructor Summary
ZippedDocumentTextExtractor()
           
 
Method Summary
static Reader create(InputStream in, ZippedDocumentTextExtractor.FilterFactory filterFactory, String[] entryPatterns)
          Retrieves text from a document.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ZippedDocumentTextExtractor

public ZippedDocumentTextExtractor()
Method Detail

create

public static Reader create(InputStream in,
                            ZippedDocumentTextExtractor.FilterFactory filterFactory,
                            String[] entryPatterns)
                     throws IOException
Retrieves text from a document.

Parameters:
in - the original document stream.
filterFactory - a factory to create a filter that is selective about which elements contribute to the text or that can transform the text. It may be null.
entryPatterns - the regular expressions that select the ZIP-entries based on their name. If the array is empty no entries will be selected at all.
Returns:
The extracted text stream.
Throws:
IOException