Information

SpaCy Rules

When applying a SpaCy pipeline to the input text, the annotations can be filtered by a custom set of rules.
These rules are based on linguistic features, such as:

  • Size of an annotation text
  • Triviality of words of an annotation text (stopwords)
  • Linguistic features of words of an annotation text

In order to be part of the output set, an annotation needs to fulfill all filter rules.
Three different rule types are available:

  • length may exclude annotations that exceed (max) or fall below (min) a certain character length threshold.
  • non-stopwords may only include annotations with any or all word tokens not being stopwords. (require)
    Note: Using deny causes the service to block annotations without stopwords. (This is not recommended.)
  • linguistics may only include annotations with any or all word tokens being members of the comma-separated list by a given linguistic feature. (require)
    Using deny causes the service to block annotations that match the given linguistic features.

In addition to the linguistic ruleset, a lemmatization step can be enabled to lemmatize the text before entity search. The lemmatization can be enabled by adding an additional rule component of type lemmatize to the filter rules.

You can add additional pretrained SpaCy pipelines in the settings page. The specific pipeline must be downloadable through:
python3 -m spacy download PIPELINE_NAME


Examples:

Look for annotations longer than 4 characters

length min 5


Look for annotations that consist of at least one NOUN

linguistics any pos NOUN,PROPN require


Look for annotations that do not include a single stopword

non-stopword all require