Keywords extractor

The Keywords extractor is a tool allowing to extract keywords from "title_morphosyntax" and "content_morphosyntax" field of tkeir document. This tools is a rest service where the API is described in API section and the configuration file is described in Configuration section.

Example of Configuration:

    "logger": {
        "logging-level": "{{ project.loglevel }}"
    "keywords": {
            "resources-base-path":"{{ project.path }}/configs",
        "network": {
            "associate-environment": {

Keywords extractor is an aggreation of network configuration, serialize configuration, runtime configuration (in field converter), logger (at top level). The extractor allows to define validation rules for keywords:

  • language :the language of tokenizer
  • resources-base-path: the path to the resources (containing file created by tools
  • keywords-rules : validation rules
  • prunning : max number of words in keyword sequence

Keywords rules allows to filter and validate rules according to their POS Tags.

Example of Configuration:

    "keywords-pos-validation":{"possible-pos-in-syntagm":["PROPN","ADJ","NOUN","ADV", "PART"],"at-least":["PROPN","NOUN","ADJ"]},    
        "suppress-bounds-sw": true,
        "pos-to-suppress": ["ADP", "ADV", "AUX", "CONJ", "CCONJ", "DET", "INTJ", "PART","SCONJ", "SYM", "SPACE", "X", "PRON", "PUNCT","SYM"]

The validation rule:

  • possible-pos-in-syntagm: the list of the accepted POS tags in the syntagm associated to Named entity
  • at-least: the minimal POS Tag

Configure Keywords extractor logger

Logger is configuration at top level of json in logger field.

Example of Configuration:

logger configuration
    "logger": {
        "logging-level": "debug"

The logger fields is:

  • logging-level

It can be set to the following values:

  • debug for the debug level and developper information
  • info for the level of information
  • warning to display only warning and errors
  • error to display only error
  • critical to display only error

Configure Keywords extractor Network

Example of Configuration:

network configuration
    "network": {
        "associate-environment": {

The network fields:

  • host : hostname

  • port : port of the service

  • associated-environement : is the "host" and "port" associated environment variables that allows to replace the default one. This field is not mandatory.

  • "host" : associated "host" environment variable

  • "port" : associated "port" environment variable


  • cert : certificate file

  • key : key file

Configure Keywords extractor runtime

Example of Configuration:

network configuration

The Runtime fields:

  • request-max-size : how big a request may be (bytes)

  • request-buffer-queue-size: request streaming buffer queue size

  • request-timeout : how long a request can take to arrive (sec)

  • response-timeout : how long a response can take to process (sec)

  • keep-alive: keep-alive

  • keep-alive-timeout: how long to hold a TCP connection open (sec)

  • graceful-shutdown_timeout : how long to wait to force close non-idle connection (sec)

  • workers : number of workers for the service on a node

  • associated-environement : if one of previous field is on the associated environment variables that allows to replace the default one. This field is not mandatory.

  • request-max-size : overwrite with environement variable

  • request-buffer-queue-size: overwrite with environement variable
  • request-timeout : overwrite with environement variable
  • response-timeout : overwrite with environement variable
  • keep-alive: overwrite with environement variable
  • keep-alive-timeout: overwrite with environement variable
  • graceful-shutdown_timeout : overwrite with environement variable
  • workers : overwrite with environement variable

Keywords extractor service

To run the command type simply from tkeir directory:

python3 thot/ --config=<path to keywords configuration file>

or if you install tkeir wheel:

tkeir-keywordextractor-svc --config=<path to keywords configuration file>

A light client can be run through the command

python3 thot/ --config=<path to keywords configuration file> --input=<input directory> --output=<output directory>

or if you install tkeir wheel:

tkeir-keywordextractor-client --config=<path to keywords configuration file> --input=<input directory> --output=<output directory>

Keywords extractor Tests

The Keywords extractor service come with unit and functional testing.

Keywords Unit tests

Unittest allows to test Tokenizer classes only.

python3 -m unittest thot/tests/unittests/
Keywords extractor Functional tests

python3 -m unittest thot/tests/functional_tests/