Search engine

Index

Document indexing is the step allowing to store in the elastic search the document analyzed during the steps of tokenization, tagging, keyword extraction ...

Indexing configuration

Example of Configuration:

indexing.json

{
    "logger": {
        "logging-level": "{{ project.loglevel }}"
    },
    "indexing": {
        "document":{
            "remove-knowledge-graph-duplicates":true
        },
        "elasticsearch":{
            "network": {
                "host": "localhost",
                "port": 9200,
                "use_ssl": false,
                "verify_certs": false,
                "auth":{
                    "user":"admin",
                    "password":"admin",
                    "associate-environment": {
                        "user":"OPENDISTRO_USER",
                        "password":"OPENDISTRO_PASSWORD"
                    }
                },
                "associate-environment": {
                    "host":"OPENDISTRO_DNS_HOST",
                    "port":"OPENDISTRO_PORT",
                    "use_ssl":"OPENDISTRO_USE_SSL",
                    "verify_certs":"OPENDISTRO_VERIFY_CERTS"
                }
            },
            "nms-index":{
               "name":"default-nms-index",              
               "mapping-file":"{{ project.path }}/resources/indices/indices_mapping/nms_cache_index.json"
            },
            "text-index":{
                "name":"default-text-index",
                "mapping-file":"{{ project.path }}/resources/indices/indices_mapping/cache_index.json"
            },
            "relation-index":{
                "name":"default-relation-index",
                "mapping-file":"{{ project.path }}/resources/indices/indices_mapping/relation_index.json"
            }
        },
        "network": {
            "host":"0.0.0.0",
            "port":10012,
            "associate-environment": {
                "host":"INDEX_HOST",
                "port":"INDEX_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":500,
            "graceful-shutown-timeout":15.0,
            "request-timeout":600,
            "response-timeout":600,
            "workers":1
        }
     }
}

The indexing configuration is an aggreation of serialize configuration, logger (at top level). The the indexing configuration needs an ElasticSearch :

document/remove-knowledge-graph-duplicates: in the analyzed document knowledge graph items are positions wise, to avoid duplication in the index you can suppress position an thus make items uniq
elasticsearch/network: network configuration of E.S.
elasticsearch/nms-index: index containing vectors (obsolete),
elasticsearch/text-index: index containing analyzed textual document,
elasticsearch/relation-index: index containing relations (obsolete),

Run Index process

To run the command type simply from tkeir directory:

python3 thot/tkeir2index --config=<path to indexing configuration file> -d <directory to index> -t document

Another way is to use index service (here each document are indexed separately, with tkeir2index we use bulk that is much more efficient)

python3 thot/index_svc --config=<path to indexing configuration file>

or if you install tkeir wheel:

tkeir-index-svc --config=<path to indexing configuration file>

It is possible to use a quick client:

python3 thot/index_client --config=<path to indexing configuration file> -i <path to tkeir document>

or if you install tkeir wheel:

tkeir-index-client --config=<path to indexing configuration file> -i <path to tkeir document>

Search

Search component can be see as a proxy to ElasticSearch (E.S.). Nonetheless, it allows to create specifc E.S queries based on query analysis. It also allows to manipulate ranking scores.

Search API

Search Configuration

Example of Configuration:

search.json

{
    "logger": {
        "logging-level": "{{ project.loglevel }}"
    },
    "searching": {        
        "document-index-name":"default-text-index",
        "disable-document-analysis":false,
        "aggregator":{
            "enable":false,
            "host":"localhost",
            "port":"18888",
            "index":true,
            "engines":["wikipedia","github","qwantp","ai4europe"],
            "index-pipeline":{
                "host":"localhost",
                "port":"10006",
                "use-ssl":false,
                "no-ssl-verify":false,
                "associate-environment": {
                    "host":"PIPELINE_HOST",
                    "port":"PIPELINE_PORT"
                }
            }
        },
        "qa":{
            "enable":true,
            "host":"localhost",
            "port":10011,
            "associate-environment": {
                "host":"QA_HOST",
                "port":"QA_PORT"
            },
            "max-question-size":32,
            "max-ranked-doc":5,
            "use-ssl":false,
            "no-ssl-verify":false
        },
        "suggester":{
            "number-of-suggestions":10,
            "spell-check":true
        },
        "search-policy":{            
            "semantic-cluster":{
                "semantic-quantizer-model":"{{ project.path }}/resources/modeling/relation_names.model.pkl"
            },
            "settings":{                
                "basic-querying":{
                    "uniq-word-query":true,
                    "boosted-uniq-word-query":false,
                    "cut-query":4096
                },
                "advanced-querying":{
                    "use-lemma":true,
                    "use-keywords":true,
                    "use-knowledge-graph":true,
                    "use-semantic-keywords":true,
                    "use-semantic-knowledge-graph":true, 
                    "use-concepts":true,
                    "use-sentences":false,
                    "querying":{
                        "match-phrase-slop":3,
                        "match-phrase-boosting":0.5,
                        "match-sentence":{
                            "number-and-symbol-filtering":true,
                            "max-number-of-words":30
                        },
                        "match-keyword":{
                            "number-and-symbol-filtering":true,                            
                            "semantic-skip-highest-ranked-classes":3,
                            "semantic-max-boosting":5
                        },
                        "match-svo":{
                            "semantic-use-class-triple":false,
                            "semantic-use-lemma-property-object":false,
                            "semantic-use-subject-lemma-object":false,
                            "semantic-use-subject-property-lemma":false,
                            "semantic-use-lemma-lemma-object":true,
                            "semantic-use-lemma-property-lemma":true,
                            "semantic-use-subject-lemma-lemma":true,
                            "semanic-max-boosting":5
                        },
                        "match-concept":{
                            "concept-boosting":0.2,
                            "concept-pruning":10
                        }
                    }
                },
                "query-expansion":{
                    "term-pruning":128,
                    "suppress-number":true,
                    "keep-word-collection-thresold-under":0.4,
                    "word-boost-thresold-above":0.25
                },
                "scoring":{
                    "normalize-score":true,                    
                    "document-query-intersection-penalty":"by-query-size",
                    "run-clause-separately":false,
                    "expand-results":50
                },
                "results":{
                    "see-also":{
                        "number-of-cross-links":10                        
                    },
                    "named-entity-explain": {
                        "min-score":0.25,
                        "max-query":3
                    },
                    "default-from":0,
                    "default-size":5,
                    "set-highlight":false,
                    "excludes":[]
                }
            }            
        },
        "elasticsearch":{
            "network": {
                "host": "tkeir-opendistro",
                "port": 9200,
                "use_ssl": true,
                "verify_certs": false,
                "auth":{
                    "user":"admin",
                    "password":"admin",
                    "associate-environment": {
                        "user":"OPENDISTRO_USER",
                        "password":"OPENDISTRO_PASSWORD"
                    }
                },
                "associate-environment": {
                    "host":"OPENDISTRO_DNS_HOST",
                    "port":"OPENDISTRO_PORT",
                    "use_ssl":"OPENDISTRO_USE_SSL",
                    "verify_certs":"OPENDISTRO_VERIFY_CERTS"
                }
            }
        },
        "network": {
            "host":"0.0.0.0",
            "port":9000,
            "associate-environment": {
                "host":"SEARCH_HOST",
                "port":"SEARCH_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "tokenizers": {
        "segmenters":[{
            "language":"en",
            "resources-base-path":"{{ project.path }}/resources/modeling/tokenizer/en",
            "normalization-rules":"tokenizer-rules.json",
            "mwe": "tkeir_mwe.pkl"
        }],
        "network": {
            "host":"0.0.0.0",
            "port":10001,
            "associate-environment": {
                "host":"TOKENIZER_HOST",
                "port":"TOKENIZER_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "morphosyntax": {
        "taggers":[{
            "language":"en",
            "resources-base-path":"{{ project.path }}/resources/modeling/tokenizer/en",
            "mwe": "tkeir_mwe.pkl",
            "pre-sentencizer": true,
            "pre-tagging-with-concept":true,
            "add-concept-in-knowledge-graph":true
        }],
        "network": {
            "host":"0.0.0.0",
            "port":10002,
            "associate-environment": {
                "host":"MSTAGGER_HOST",
                "port":"MSTAGGER_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "named-entities": {
        "label":[{
            "language":"en",
            "resources-base-path":"{{ project.path }}/resources/modeling/tokenizer/en",
            "mwe": "tkeir_mwe.pkl",
            "ner-rules": "ner-rules.json",
            "use-pre-label":true
        }],
        "network": {
            "host":"0.0.0.0",
            "port":10003,
            "associate-environment": {
                "host":"NERTAGGER_HOST",
                "port":"NERTAGGER_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "embeddings": {
        "models":[
        { 
            "language":"multi",
            "use-cuda":false,
            "batch-size":256
        }
        ],
        "network": {
            "host":"0.0.0.0",
            "port":10005,
            "associate-environment": {
                "host":"SENT_EMBEDDING_HOST",
                "port":"SENT_EMBEDDING_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "syntax": {
        "taggers":[{
            "language":"en",
            "resources-base-path":"{{ project.path }}/configs",
            "syntactic-rules": "syntactic-rules.json"
        }],
        "network": {
            "host":"0.0.0.0",
            "port":10004,
            "associate-environment": {
                "host":"SYNTAXTAGGER_HOST",
                "port":"SYNTAXTAGGER_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "keywords": {
        "extractors":[{
            "language":"en",
            "resources-base-path":"{{ project.path }}/resources/modeling/tokenizer/en",
            "stopwords":"en.stopwords.lst",
            "use-lemma":true,
            "use-pos":true,
            "use-form":false            
        }],
        "network": {
            "host":"0.0.0.0",
            "port":10007,
            "associate-environment": {
                "host":"KEYWORD_HOST",
                "port":"KEYWORD_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    }
}

The search configuration allows to set up the search behaviour according the query analysis. The other configuration are not specific to the service.

searching/document-index-name: the name of the index where are stored the documents
searching/disable-document-analysis : document analysis is not mandatory and can be disable
searching/qa/host : question answering sub system host
searching/qa/port : question answering sub system port
searching/qa/max-question-size: max size of the question (to run qa subsystem)
searching/qa/max-ranked-doc: max number of ranked document where is appy Q/A
searching/qa/use-ssl: qa subsystem access by ssl
searching/qa/no-ssl-verify: qa subsystem access verify certificates
searching/suggester/number-of-suggestions: max number of suggestions
searching/suggester/spell-check: Not yet implemented
searching/aggregator/host: hostname of searx
searching/aggregator/port": port of searx
searching/aggregator/index": index (or not) searx results
searching/aggregator/engines": searx engines used
searching/aggregator/index-pipeline":index pipeline network configuration
searching/search-policy/semantic-cluster/semantic-quantizer-model : path to clustering model to use "statistical" semantic
searching/search-policy/settings/basic-querying/uniq-word-query : transform query in bag of word
searching/search-policy/settings/basic-querying/boosted-uniq-word-query : weigthening words according to their frequency in query
searching/search-policy/settings/basic-querying/cut-query": maximum number of uniq word with the query
searching/search-policy/settings/advanced-querying/use-lemma:use lemmatised field of index
searching/search-policy/settings/advanced-querying/use-keywords: use keywords field of index
searching/search-policy/settings/advanced-querying/use-knowledge-graph: use knowledge graph (the triple) of index
searching/search-policy/settings/advanced-querying/use-concepts: use concepts of the index
searching/search-policy/settings/advanced-querying/use-sentences: use sentence querying
searching/search-policy/settings/advanced-querying/querying/match-phrase-slop: slop in match phrase clause
searching/search-policy/settings/advanced-querying/querying/match-phrase-boosting: default boosting value for match phrase
searching/search-policy/settings/advanced-querying/querying/match-sentence/number-and-symbol-filtering": filter symbol andnumber from sentences
searching/search-policy/settings/advanced-querying/querying/match-sentence/max-number-of-words: set the maximum length (words) in the sentence
searching/search-policy/settings/advanced-querying/querying/match-keywords/match-keyword/number-and-symbol-filtering": filter number and symbols
searching/search-policy/settings/advanced-querying/querying/match-keywords/semantic-skip-highest-ranked-classes: when you use semantic class (comming from clustering) the most common classes are often irrelevant, you can skip this classes
searching/search-policy/settings/advanced-querying/querying/match-keywords/semantic-max-boosting": query boosting in match-phrase clause
searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-class-triple : create query clause with all semantic classes
searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-lemma-property-object": use lemma on subject, class on property and object
searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-subject-lemma-object": use lemma on property, class on subject and object
searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-subject-property-lemma": use lemma on object, class on subject an property
searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-lemma-lemma-object": use lemma on subject and property, class on object
searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-lemma-property-lemma": use lemma on subject ad object, class on property
searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-subject-lemma-lemma": use lemma on property and object, class on subject
searching/search-policy/settings/advanced-querying/querying/match-svo/semanic-max-boosting":no yet implement
searching/search-policy/settings/advanced-querying/querying/match-concept/concept-boosting: boost concept clause
searching/search-policy/settings/advanced-querying/querying/match-concept/concept-pruning: top N of concept used
searching/search-policy/settings/query-expansion/term-pruning: max number of term used in expansion
searching/search-policy/settings/query-expansion/suppress-number: filter number
searching/search-policy/settings/query-expansion/suppress-numberkeep-word-collection-thresold-under : frequency of document where the word appear should lesser than this frequency
searching/search-policy/settings/query-expansion/word-boost-thresold-above: frequency of word in the document should be greater than this number
searching/search-policy/settings/scoring/normalize-score: normalize elastic search score max score max
searching/search-policy/settings/scoring/document-query-intersection-penalty: document and query intersection normalized : no-normalization, by-query-size, by-union-size(jaccard)
searching/search-policy/settings/scoring/run-clause-separately: clause can be run separately in this case the ranked lists are merged, or put in a uniq query
searching/search-policy/settings/scoring/expand-results": when you run clause separately and merge result it is interesting to expand result list size to cover more ranked documents
searching/search-policy/settings/results/set-highlight: highlight snippets
searching/search-policy/settings/results/see-also/number-of-cross-links: compute see-also graph with number of cross links docs per ranked doc
searching/search-policy/settings/resultsexcludes : excluded some fields from returned list

Configure Search Network

Example of Configuration:

network configuration

{
    "network": {
        "host":"0.0.0.0",
        "port":8080,
        "associate-environment": {
            "host":"HOST_ENVNAME",
            "port":"PORT_ENVNAME"
        },
        "ssl":
        {
            "certificate":"path/to/certificate",
            "key":"path/to/key"
        }
    }
}

The network fields:

host : hostname
port : port of the service
associated-environement

: default one. This field is not mandatory.

"host" : associated "host" environment variable
"port" : associated "port" environment variable
ssl : ssl configuration IN PRODUCTION IT IS MANDATORY TO USE CERTIFICATE AND KEY THAT ARE *NOT* SELF SIGNED
cert : certificate file
key : key file

Configure Search runtime

Example of Configuration:

network configuration

{
    "runtime":{
        "request-max-size":100000000,
        "request-buffer-queue-size":100,
        "keep-alive":true,
        "keep-alive-timeout":5,
        "graceful-shutown-timeout":15.0,
        "request-timeout":60,
        "response-timeout":60,
        "workers":1
    }    
}

The Runtime fields:

request-max-size : how big a request may be (bytes)
request-buffer-queue-size: request streaming buffer queue size
request-timeout : how long a request can take to arrive (sec)
response-timeout : how long a response can take to process (sec)
keep-alive: keep-alive
keep-alive-timeout: how long to hold a TCP connection open (sec)
graceful-shutdown_timeout : how long to wait to force close non-idle connection (sec)
workers : number of workers for the service on a node
associated-environement : if one of previous field is on the associated environment variables that allows to replace the default one. This field is not mandatory.
request-max-size : overwrite with environement variable
request-buffer-queue-size: overwrite with environement variable
request-timeout : overwrite with environement variable
response-timeout : overwrite with environement variable
keep-alive: overwrite with environement variable
keep-alive-timeout: overwrite with environement variable
graceful-shutdown_timeout : overwrite with environement variable
workers : overwrite with environement variable

Run Search engine service

To run the command type simply from tkeir directory:

python3 thot/search_svc.py --config=<path to relation configuration file>

or if you install tkeir wheel:

tkeir-search-svc --config=<path to relation configuration file>