Class OpenNLPDocumentParser

  • All Implemented Interfaces:
    DocumentParser, java.util.function.Function<java.lang.String,​Document>

    public class OpenNLPDocumentParser
    extends java.lang.Object
    implements DocumentParser
    DocumentParser based on OpenNLP
    Since:
    5.2
    Version:
    5.2
    Author:
    Pedro Oliveira
    • Constructor Summary

      Constructors 
      Constructor Description
      OpenNLPDocumentParser​(opennlp.tools.sentdetect.SentenceDetector theSentenceDetector, opennlp.tools.tokenize.Tokenizer theTokenizer)  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void add​(opennlp.tools.namefind.TokenNameFinder theNameFinder)  
      Document apply​(java.lang.String theText)
      NameFinderMEs are not thread-safe, and share internal state between calls.
      static opennlp.tools.chunker.ChunkerME chunker​(java.io.File theModel)  
      static void clearSharedModels()
      Allow shared models to be GC'd as they potentially have a large memory footprint
      static opennlp.tools.lemmatizer.DictionaryLemmatizer dictionaryLemmatizer​(java.io.File theDictionary)  
      static opennlp.tools.namefind.DictionaryNameFinder dictNameFinder​(java.io.File theModel, java.lang.String theType)  
      static OpenNLPDocumentParser getDefault​(Connection theConnection)
      Lazily load OpenNLPDocumentParser models from the given the database configurations
      static opennlp.tools.lemmatizer.LemmatizerME lemmatizer​(java.io.File theModel)  
      static OpenNLPDocumentParser loadFrom​(java.io.File theDirectory)
      Loads OpenNLP models, in their default name formats, from the given directory.
      static opennlp.tools.namefind.NameFinderME nameFinder​(java.io.File theModel)  
      static opennlp.tools.postag.POSTaggerME posTagger​(java.io.File theModel)  
      static opennlp.tools.sentdetect.SentenceDetectorME sentenceDetector​(java.io.File theModel)  
      void set​(opennlp.tools.chunker.Chunker theChunker)  
      void set​(opennlp.tools.lemmatizer.Lemmatizer theLemmatizer)  
      void set​(opennlp.tools.postag.POSTagger thePOSTagger)  
      static opennlp.tools.tokenize.TokenizerME tokenizer​(java.io.File theModel)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • Methods inherited from interface java.util.function.Function

        andThen, compose
    • Constructor Detail

      • OpenNLPDocumentParser

        public OpenNLPDocumentParser​(opennlp.tools.sentdetect.SentenceDetector theSentenceDetector,
                                     opennlp.tools.tokenize.Tokenizer theTokenizer)
    • Method Detail

      • add

        public void add​(opennlp.tools.namefind.TokenNameFinder theNameFinder)
      • set

        public void set​(opennlp.tools.postag.POSTagger thePOSTagger)
      • set

        public void set​(opennlp.tools.lemmatizer.Lemmatizer theLemmatizer)
      • set

        public void set​(opennlp.tools.chunker.Chunker theChunker)
      • apply

        public Document apply​(java.lang.String theText)
        NameFinderMEs are not thread-safe, and share internal state between calls. They can only be safely used from another thread after clearAdaptiveData is called. Due to the method being basically a loop, it's safer and easier just to make it synchronized as a whole. Other option would be to cache the TokenNameFinderModel, which is thread safe, and create a new NameFinderME in each call, but those objects are heavy and create a lot of other complex objects.
        Specified by:
        apply in interface java.util.function.Function<java.lang.String,​Document>
      • clearSharedModels

        public static void clearSharedModels()
        Allow shared models to be GC'd as they potentially have a large memory footprint
      • loadFrom

        public static OpenNLPDocumentParser loadFrom​(java.io.File theDirectory)
                                              throws java.io.IOException
        Loads OpenNLP models, in their default name formats, from the given directory. E.g., folder with files ['en-sent.bin', 'en-token.bin', 'en-ner-organization.bin', 'en-ner-person.bin']
        Throws:
        java.io.IOException
      • sentenceDetector

        public static opennlp.tools.sentdetect.SentenceDetectorME sentenceDetector​(java.io.File theModel)
                                                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • tokenizer

        public static opennlp.tools.tokenize.TokenizerME tokenizer​(java.io.File theModel)
                                                            throws java.io.IOException
        Throws:
        java.io.IOException
      • nameFinder

        public static opennlp.tools.namefind.NameFinderME nameFinder​(java.io.File theModel)
                                                              throws java.io.IOException
        Throws:
        java.io.IOException
      • dictNameFinder

        public static opennlp.tools.namefind.DictionaryNameFinder dictNameFinder​(java.io.File theModel,
                                                                                 java.lang.String theType)
                                                                          throws java.io.IOException
        Throws:
        java.io.IOException
      • posTagger

        public static opennlp.tools.postag.POSTaggerME posTagger​(java.io.File theModel)
                                                          throws java.io.IOException
        Throws:
        java.io.IOException
      • chunker

        public static opennlp.tools.chunker.ChunkerME chunker​(java.io.File theModel)
                                                       throws java.io.IOException
        Throws:
        java.io.IOException
      • lemmatizer

        public static opennlp.tools.lemmatizer.LemmatizerME lemmatizer​(java.io.File theModel)
                                                                throws java.io.IOException
        Throws:
        java.io.IOException
      • dictionaryLemmatizer

        public static opennlp.tools.lemmatizer.DictionaryLemmatizer dictionaryLemmatizer​(java.io.File theDictionary)
                                                                                  throws java.io.IOException
        Throws:
        java.io.IOException