Syntactic tagger
The Named entity tagger is a tool allowing to extract Named Entities from "title_tokens" and "content_tokens" field of tkeir document. This tools is a rest service where the API is described in API section and the configuration file is described in Configuration section.
Syntactic tagger API
Syntactic tagger configuration
Example of Configuration:
{
"logger": {
"logging-level": "{{ project.loglevel }}"
},
"syntax": {
"taggers":[{
"language":"en",
"resources-base-path":"{{ project.path }}/configs",
"syntactic-rules": "syntactic-rules.json"
}],
"network": {
"host":"0.0.0.0",
"port":10004,
"associate-environment": {
"host":"SYNTAXTAGGER_HOST",
"port":"SYNTAXTAGGER_PORT"
}
},
"runtime":{
"request-max-size":100000000,
"request-buffer-queue-size":100,
"keep-alive":true,
"keep-alive-timeout":500,
"graceful-shutown-timeout":15.0,
"request-timeout":600,
"response-timeout":600,
"workers":1
}
}
}
Syntactic is an aggreation of network configuration, serialize configuration, runtime configuration (in field converter), logger (at top level).
Syntactic Rules allows to define rule for triple Subject, Predicate, Object extraction
Example of Configuration:
{
"pattern_syntagm_or_prep_group":{
"rule":[[
{"POS":{"IN":["PREP","DET","ADP","NOUN","PROPN","ADJ","PRON"]},"OP":"*"},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["NOUN","PROPN","ADJ"]},"OP":"+"}
],
[
{"POS":{"IN":["PREP","DET","ADP","NOUN","PROPN","ADJ","PRON"]},"OP":"*"},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["NOUN","PROPN","ADJ"]},"OP":"+"},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["CONJ","CCONJ"]}},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["NOUN","PROPN","ADJ","DET"]},"OP":"+"}
]],
"type":["subject","object"]
},
"pattern_infinitive_verb":{
"rule":[
[
{"LOWER":"to"},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["VERB","AUX"]}}
]
],
"type":["predicate"]
},
"pattern_pro":{
"rule":[
[
{"POS":{"IN":["PRON"]}}
]
],
"type":["subject"]
},
"pattern_verb_phrase":{
"rule":[
[
{"POS":{"IN":["VERB","AUX"]},"OP":"+"},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["ADV","ADP"]},"OP":"?"},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["VERB","AUX"]},"OP":"*"}
],
[
{"POS":{"IN":["VERB","AUX"]},"OP":"+"},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["ADV","ADP"]},"OP":"?"},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["VERB","AUX"]},"OP":"*"},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["CONJ","CCONJ"]}},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["VERB","AUX"]},"OP":"+"},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["ADV","ADP"]},"OP":"?"},
{"POS":{"IN":["SPACE"]},"OP":"*"},
{"POS":{"IN":["VERB","AUX"]},"OP":"*"}
]
],
"type":["predicate"]
},
"conj_rule":{
"rule":
[[{"POS":{"IN":["CONJ", "CCONJ"]},"OP":"+"}]],
"type":["empty"]
},
"link_rule":{
"type":["link"],
"rule":[
{"match-rule":"pattern_verb_phrase", "end-with":"ADP"},
{"match-rule":"pattern_syntagm_or_prep_group", "start-with":"ADP"}
],
"action": {
"on":"span-right",
"shift":"right"
}
},
"available-name-entities": {
"list": ["person","organization",
"location","location.city","location.country",
"product","facility","event",
"money","quantity","date","time","energyterm","financeterm",
"url","email","chemestry"],
"type": ["named-entity-list"]
},
"triple_ner":{
"type":["triple"],
"rule":[
[{"subject":"pattern_syntagm_or_prep_group"}, {"predicate":"pattern_verb_phrase"}, {"object":"pattern_syntagm_or_prep_group"}],
[{"subject":"available-name-entities"}, {"predicate":"pattern_verb_phrase"}, {"object":"available-name-entities"}],
[{"subject":"pattern_pro"}, {"predicate":"pattern_verb_phrase"}, {"object":"pattern_syntagm_or_prep_group"}],
[{"subject":"pattern_pro"}, {"predicate":"pattern_verb_phrase"}, {"object":"available-name-entities"}],
[{"subject":"available-name-entities"}, {"predicate":"pattern_verb_phrase"}, {"object":"pattern_syntagm_or_prep_group"}],
[{"subject":"pattern_syntagm_or_prep_group"}, {"predicate":"pattern_verb_phrase"}, {"object":"available-name-entities"}]
]
},
"settings":{
"suppress-bounds-sw": true,
"pos-to-suppress": ["ADP", "ADV", "AUX", "CONJ", "CCONJ", "DET", "INTJ", "PART","SCONJ", "SYM", "SPACE", "X", "PRON", "PUNCT"]
}
}
The rules allows to extract triple based on sequence matcher of spacy
The syntax of the field is:
-
\
: -
rule : matcher rule or triple rule
- type : subject, object, predicate of triple
Configure Syntactic tagger logger
Logger is configuration at top level of json in logger field.
Example of Configuration:
The logger fields is:
- logging-level
It can be set to the following values:
- debug for the debug level and developper information
- info for the level of information
- warning to display only warning and errors
- error to display only error
- critical to display only error
Configure Syntactic tagger Network
Example of Configuration:
{
"network": {
"host":"0.0.0.0",
"port":8080,
"associate-environment": {
"host":"HOST_ENVNAME",
"port":"PORT_ENVNAME"
},
"ssl":
{
"certificate":"path/to/certificate",
"key":"path/to/key"
}
}
}
The network fields:
-
host : hostname
-
port : port of the service
-
associated-environement : is the "host" and "port" associated environment variables that allows to replace the default one. This field is not mandatory.
-
"host" : associated "host" environment variable
-
"port" : associated "port" environment variable
-
ssl : ssl configuration IN PRODUCTION IT IS MANDATORY TO USE CERTIFICATE AND KEY THAT ARE *NOT* SELF SIGNED
-
cert : certificate file
- key : key file
Configure Syntactic tagger runtime
Example of Configuration:
{
"runtime":{
"request-max-size":100000000,
"request-buffer-queue-size":100,
"keep-alive":true,
"keep-alive-timeout":5,
"graceful-shutown-timeout":15.0,
"request-timeout":60,
"response-timeout":60,
"workers":1
}
}
The Runtime fields:
- request-max-size : how big a request may be (bytes)
- request-buffer-queue-size: request streaming buffer queue size
- request-timeout : how long a request can take to arrive (sec)
- response-timeout : how long a response can take to process (sec)
- keep-alive: keep-alive
- keep-alive-timeout: how long to hold a TCP connection open (sec)
- graceful-shutdown_timeout : how long to wait to force close non-idle connection (sec)
- workers : number of workers for the service on a node
- associated-environement : if one of previous field is on the associated environment variables that allows to replace the default one. This field is not mandatory.
- request-max-size : overwrite with environement variable
- request-buffer-queue-size: overwrite with environement variable
- request-timeout : overwrite with environement variable
- response-timeout : overwrite with environement variable
- keep-alive: overwrite with environement variable
- keep-alive-timeout: overwrite with environement variable
- graceful-shutdown_timeout : overwrite with environement variable
- workers : overwrite with environement variable
Syntactic tagger service
To run the command type simply from tkeir directory:
or if you install tkeir wheel:
A light client can be run through the command
python3 thot/syntactictagger_client.py --config=<path to ner tagger configuration file> --input=<input directory> --output=<output directory>
or if you install tkeir wheel:
tkeir-syntactictagger-client --config=<path to ner tagger configuration file> --input=<input directory> --output=<output directory>
Syntactic tagger Tests
The Syntactic tagger service come with unit and functional testing.
Syntactic Tagger Unit tests
Unittest allows to test Tokenizer classes only.
python3 -m unittest thot/tests/unittests/TestSyntacticTaggerConfiguration.py
python3 -m unittest thot/tests/unittests/TestSyntacticTagger.py