Skip to content

Search engine

Index

Document indexing is the step allowing to store in the elastic search the document analyzed during the steps of tokenization, tagging, keyword extraction ...

Indexing configuration

Example of Configuration:

indexing.json
{
    "logger": {
        "logging-level": "{{ project.loglevel }}"
    },
    "indexing": {
        "document":{
            "remove-knowledge-graph-duplicates":true
        },
        "elasticsearch":{
            "network": {
                "host": "localhost",
                "port": 9200,
                "use_ssl": false,
                "verify_certs": false,
                "auth":{
                    "user":"admin",
                    "password":"admin",
                    "associate-environment": {
                        "user":"OPENDISTRO_USER",
                        "password":"OPENDISTRO_PASSWORD"
                    }
                },
                "associate-environment": {
                    "host":"OPENDISTRO_DNS_HOST",
                    "port":"OPENDISTRO_PORT",
                    "use_ssl":"OPENDISTRO_USE_SSL",
                    "verify_certs":"OPENDISTRO_VERIFY_CERTS"
                }
            },
            "nms-index":{
               "name":"default-nms-index",              
               "mapping-file":"{{ project.path }}/resources/indices/indices_mapping/nms_cache_index.json"
            },
            "text-index":{
                "name":"default-text-index",
                "mapping-file":"{{ project.path }}/resources/indices/indices_mapping/cache_index.json"
            },
            "relation-index":{
                "name":"default-relation-index",
                "mapping-file":"{{ project.path }}/resources/indices/indices_mapping/relation_index.json"
            }
        },
        "network": {
            "host":"0.0.0.0",
            "port":10012,
            "associate-environment": {
                "host":"INDEX_HOST",
                "port":"INDEX_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":500,
            "graceful-shutown-timeout":15.0,
            "request-timeout":600,
            "response-timeout":600,
            "workers":1
        }
     }
}

The indexing configuration is an aggreation of serialize configuration, logger (at top level). The the indexing configuration needs an ElasticSearch :

  • document/remove-knowledge-graph-duplicates: in the analyzed document knowledge graph items are positions wise, to avoid duplication in the index you can suppress position an thus make items uniq
  • elasticsearch/network: network configuration of E.S.
  • elasticsearch/nms-index: index containing vectors (obsolete),
  • elasticsearch/text-index: index containing analyzed textual document,
  • elasticsearch/relation-index: index containing relations (obsolete),

Run Index process

To run the command type simply from tkeir directory:

python3 thot/tkeir2index --config=<path to indexing configuration file> -d <directory to index> -t document

Another way is to use index service (here each document are indexed separately, with tkeir2index we use bulk that is much more efficient)

python3 thot/index_svc --config=<path to indexing configuration file>

or if you install tkeir wheel:

tkeir-index-svc --config=<path to indexing configuration file>

It is possible to use a quick client:

python3 thot/index_client --config=<path to indexing configuration file> -i <path to tkeir document>
or if you install tkeir wheel:

tkeir-index-client --config=<path to indexing configuration file> -i <path to tkeir document>

Search component can be see as a proxy to ElasticSearch (E.S.). Nonetheless, it allows to create specifc E.S queries based on query analysis. It also allows to manipulate ranking scores.

Search API

Search Configuration

Example of Configuration:

search.json
{
    "logger": {
        "logging-level": "{{ project.loglevel }}"
    },
    "searching": {        
        "document-index-name":"default-text-index",
        "disable-document-analysis":false,
        "aggregator":{
            "enable":false,
            "host":"localhost",
            "port":"18888",
            "index":true,
            "engines":["wikipedia","github","qwantp","ai4europe"],
            "index-pipeline":{
                "host":"localhost",
                "port":"10006",
                "use-ssl":false,
                "no-ssl-verify":false,
                "associate-environment": {
                    "host":"PIPELINE_HOST",
                    "port":"PIPELINE_PORT"
                }
            }
        },
        "qa":{
            "enable":true,
            "host":"localhost",
            "port":10011,
            "associate-environment": {
                "host":"QA_HOST",
                "port":"QA_PORT"
            },
            "max-question-size":32,
            "max-ranked-doc":5,
            "use-ssl":false,
            "no-ssl-verify":false
        },
        "suggester":{
            "number-of-suggestions":10,
            "spell-check":true
        },
        "search-policy":{            
            "semantic-cluster":{
                "semantic-quantizer-model":"{{ project.path }}/resources/modeling/relation_names.model.pkl"
            },
            "settings":{                
                "basic-querying":{
                    "uniq-word-query":true,
                    "boosted-uniq-word-query":false,
                    "cut-query":4096
                },
                "advanced-querying":{
                    "use-lemma":true,
                    "use-keywords":true,
                    "use-knowledge-graph":true,
                    "use-semantic-keywords":true,
                    "use-semantic-knowledge-graph":true, 
                    "use-concepts":true,
                    "use-sentences":false,
                    "querying":{
                        "match-phrase-slop":3,
                        "match-phrase-boosting":0.5,
                        "match-sentence":{
                            "number-and-symbol-filtering":true,
                            "max-number-of-words":30
                        },
                        "match-keyword":{
                            "number-and-symbol-filtering":true,                            
                            "semantic-skip-highest-ranked-classes":3,
                            "semantic-max-boosting":5
                        },
                        "match-svo":{
                            "semantic-use-class-triple":false,
                            "semantic-use-lemma-property-object":false,
                            "semantic-use-subject-lemma-object":false,
                            "semantic-use-subject-property-lemma":false,
                            "semantic-use-lemma-lemma-object":true,
                            "semantic-use-lemma-property-lemma":true,
                            "semantic-use-subject-lemma-lemma":true,
                            "semanic-max-boosting":5
                        },
                        "match-concept":{
                            "concept-boosting":0.2,
                            "concept-pruning":10
                        }
                    }
                },
                "query-expansion":{
                    "term-pruning":128,
                    "suppress-number":true,
                    "keep-word-collection-thresold-under":0.4,
                    "word-boost-thresold-above":0.25
                },
                "scoring":{
                    "normalize-score":true,                    
                    "document-query-intersection-penalty":"by-query-size",
                    "run-clause-separately":false,
                    "expand-results":50
                },
                "results":{
                    "see-also":{
                        "number-of-cross-links":10                        
                    },
                    "named-entity-explain": {
                        "min-score":0.25,
                        "max-query":3
                    },
                    "default-from":0,
                    "default-size":5,
                    "set-highlight":false,
                    "excludes":[]
                }
            }            
        },
        "elasticsearch":{
            "network": {
                "host": "tkeir-opendistro",
                "port": 9200,
                "use_ssl": true,
                "verify_certs": false,
                "auth":{
                    "user":"admin",
                    "password":"admin",
                    "associate-environment": {
                        "user":"OPENDISTRO_USER",
                        "password":"OPENDISTRO_PASSWORD"
                    }
                },
                "associate-environment": {
                    "host":"OPENDISTRO_DNS_HOST",
                    "port":"OPENDISTRO_PORT",
                    "use_ssl":"OPENDISTRO_USE_SSL",
                    "verify_certs":"OPENDISTRO_VERIFY_CERTS"
                }
            }
        },
        "network": {
            "host":"0.0.0.0",
            "port":9000,
            "associate-environment": {
                "host":"SEARCH_HOST",
                "port":"SEARCH_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "tokenizers": {
        "segmenters":[{
            "language":"en",
            "resources-base-path":"{{ project.path }}/resources/modeling/tokenizer/en",
            "normalization-rules":"tokenizer-rules.json",
            "mwe": "tkeir_mwe.pkl"
        }],
        "network": {
            "host":"0.0.0.0",
            "port":10001,
            "associate-environment": {
                "host":"TOKENIZER_HOST",
                "port":"TOKENIZER_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "morphosyntax": {
        "taggers":[{
            "language":"en",
            "resources-base-path":"{{ project.path }}/resources/modeling/tokenizer/en",
            "mwe": "tkeir_mwe.pkl",
            "pre-sentencizer": true,
            "pre-tagging-with-concept":true,
            "add-concept-in-knowledge-graph":true
        }],
        "network": {
            "host":"0.0.0.0",
            "port":10002,
            "associate-environment": {
                "host":"MSTAGGER_HOST",
                "port":"MSTAGGER_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "named-entities": {
        "label":[{
            "language":"en",
            "resources-base-path":"{{ project.path }}/resources/modeling/tokenizer/en",
            "mwe": "tkeir_mwe.pkl",
            "ner-rules": "ner-rules.json",
            "use-pre-label":true
        }],
        "network": {
            "host":"0.0.0.0",
            "port":10003,
            "associate-environment": {
                "host":"NERTAGGER_HOST",
                "port":"NERTAGGER_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "embeddings": {
        "models":[
        { 
            "language":"multi",
            "use-cuda":false,
            "batch-size":256
        }
        ],
        "network": {
            "host":"0.0.0.0",
            "port":10005,
            "associate-environment": {
                "host":"SENT_EMBEDDING_HOST",
                "port":"SENT_EMBEDDING_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "syntax": {
        "taggers":[{
            "language":"en",
            "resources-base-path":"{{ project.path }}/configs",
            "syntactic-rules": "syntactic-rules.json"
        }],
        "network": {
            "host":"0.0.0.0",
            "port":10004,
            "associate-environment": {
                "host":"SYNTAXTAGGER_HOST",
                "port":"SYNTAXTAGGER_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    },
    "keywords": {
        "extractors":[{
            "language":"en",
            "resources-base-path":"{{ project.path }}/resources/modeling/tokenizer/en",
            "stopwords":"en.stopwords.lst",
            "use-lemma":true,
            "use-pos":true,
            "use-form":false            
        }],
        "network": {
            "host":"0.0.0.0",
            "port":10007,
            "associate-environment": {
                "host":"KEYWORD_HOST",
                "port":"KEYWORD_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    }
}

The search configuration allows to set up the search behaviour according the query analysis. The other configuration are not specific to the service.

  • searching/document-index-name: the name of the index where are stored the documents
  • searching/disable-document-analysis : document analysis is not mandatory and can be disable
  • searching/qa/host : question answering sub system host
  • searching/qa/port : question answering sub system port
  • searching/qa/max-question-size: max size of the question (to run qa subsystem)
  • searching/qa/max-ranked-doc: max number of ranked document where is appy Q/A
  • searching/qa/use-ssl: qa subsystem access by ssl
  • searching/qa/no-ssl-verify: qa subsystem access verify certificates
  • searching/suggester/number-of-suggestions: max number of suggestions
  • searching/suggester/spell-check: Not yet implemented
  • searching/aggregator/host: hostname of searx
  • searching/aggregator/port": port of searx
  • searching/aggregator/index": index (or not) searx results
  • searching/aggregator/engines": searx engines used
  • searching/aggregator/index-pipeline":index pipeline network configuration
  • searching/search-policy/semantic-cluster/semantic-quantizer-model : path to clustering model to use "statistical" semantic
  • searching/search-policy/settings/basic-querying/uniq-word-query : transform query in bag of word
  • searching/search-policy/settings/basic-querying/boosted-uniq-word-query : weigthening words according to their frequency in query
  • searching/search-policy/settings/basic-querying/cut-query": maximum number of uniq word with the query
  • searching/search-policy/settings/advanced-querying/use-lemma:use lemmatised field of index
  • searching/search-policy/settings/advanced-querying/use-keywords: use keywords field of index
  • searching/search-policy/settings/advanced-querying/use-knowledge-graph: use knowledge graph (the triple) of index
  • searching/search-policy/settings/advanced-querying/use-concepts: use concepts of the index
  • searching/search-policy/settings/advanced-querying/use-sentences: use sentence querying
  • searching/search-policy/settings/advanced-querying/querying/match-phrase-slop: slop in match phrase clause
  • searching/search-policy/settings/advanced-querying/querying/match-phrase-boosting: default boosting value for match phrase
  • searching/search-policy/settings/advanced-querying/querying/match-sentence/number-and-symbol-filtering": filter symbol andnumber from sentences
  • searching/search-policy/settings/advanced-querying/querying/match-sentence/max-number-of-words: set the maximum length (words) in the sentence
  • searching/search-policy/settings/advanced-querying/querying/match-keywords/match-keyword/number-and-symbol-filtering": filter number and symbols
  • searching/search-policy/settings/advanced-querying/querying/match-keywords/semantic-skip-highest-ranked-classes: when you use semantic class (comming from clustering) the most common classes are often irrelevant, you can skip this classes
  • searching/search-policy/settings/advanced-querying/querying/match-keywords/semantic-max-boosting": query boosting in match-phrase clause
  • searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-class-triple : create query clause with all semantic classes
  • searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-lemma-property-object": use lemma on subject, class on property and object
  • searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-subject-lemma-object": use lemma on property, class on subject and object
  • searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-subject-property-lemma": use lemma on object, class on subject an property
  • searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-lemma-lemma-object": use lemma on subject and property, class on object
  • searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-lemma-property-lemma": use lemma on subject ad object, class on property
  • searching/search-policy/settings/advanced-querying/querying/match-svo/semantic-use-subject-lemma-lemma": use lemma on property and object, class on subject
  • searching/search-policy/settings/advanced-querying/querying/match-svo/semanic-max-boosting":no yet implement
  • searching/search-policy/settings/advanced-querying/querying/match-concept/concept-boosting: boost concept clause
  • searching/search-policy/settings/advanced-querying/querying/match-concept/concept-pruning: top N of concept used
  • searching/search-policy/settings/query-expansion/term-pruning: max number of term used in expansion
  • searching/search-policy/settings/query-expansion/suppress-number: filter number
  • searching/search-policy/settings/query-expansion/suppress-numberkeep-word-collection-thresold-under : frequency of document where the word appear should lesser than this frequency
  • searching/search-policy/settings/query-expansion/word-boost-thresold-above: frequency of word in the document should be greater than this number
  • searching/search-policy/settings/scoring/normalize-score: normalize elastic search score max score max
  • searching/search-policy/settings/scoring/document-query-intersection-penalty: document and query intersection normalized : no-normalization, by-query-size, by-union-size(jaccard)
  • searching/search-policy/settings/scoring/run-clause-separately: clause can be run separately in this case the ranked lists are merged, or put in a uniq query
  • searching/search-policy/settings/scoring/expand-results": when you run clause separately and merge result it is interesting to expand result list size to cover more ranked documents
  • searching/search-policy/settings/results/set-highlight: highlight snippets
  • searching/search-policy/settings/results/see-also/number-of-cross-links: compute see-also graph with number of cross links docs per ranked doc
  • searching/search-policy/settings/resultsexcludes : excluded some fields from returned list

Configure Search Network

Example of Configuration:

network configuration
{
    "network": {
        "host":"0.0.0.0",
        "port":8080,
        "associate-environment": {
            "host":"HOST_ENVNAME",
            "port":"PORT_ENVNAME"
        },
        "ssl":
        {
            "certificate":"path/to/certificate",
            "key":"path/to/key"
        }
    }
}

The network fields:

  • host : hostname

  • port : port of the service

  • associated-environement

: default one. This field is not mandatory.

  • "host" : associated "host" environment variable
  • "port" : associated "port" environment variable

  • ssl : ssl configuration IN PRODUCTION IT IS MANDATORY TO USE CERTIFICATE AND KEY THAT ARE *NOT* SELF SIGNED

  • cert : certificate file

  • key : key file

Configure Search runtime

Example of Configuration:

network configuration
{
    "runtime":{
        "request-max-size":100000000,
        "request-buffer-queue-size":100,
        "keep-alive":true,
        "keep-alive-timeout":5,
        "graceful-shutown-timeout":15.0,
        "request-timeout":60,
        "response-timeout":60,
        "workers":1
    }    
}

The Runtime fields:

  • request-max-size : how big a request may be (bytes)

  • request-buffer-queue-size: request streaming buffer queue size

  • request-timeout : how long a request can take to arrive (sec)

  • response-timeout : how long a response can take to process (sec)

  • keep-alive: keep-alive

  • keep-alive-timeout: how long to hold a TCP connection open (sec)

  • graceful-shutdown_timeout : how long to wait to force close non-idle connection (sec)

  • workers : number of workers for the service on a node

  • associated-environement : if one of previous field is on the associated environment variables that allows to replace the default one. This field is not mandatory.

  • request-max-size : overwrite with environement variable

  • request-buffer-queue-size: overwrite with environement variable
  • request-timeout : overwrite with environement variable
  • response-timeout : overwrite with environement variable
  • keep-alive: overwrite with environement variable
  • keep-alive-timeout: overwrite with environement variable
  • graceful-shutdown_timeout : overwrite with environement variable
  • workers : overwrite with environement variable

Run Search engine service

To run the command type simply from tkeir directory:

python3 thot/search_svc.py --config=<path to relation configuration file>

or if you install tkeir wheel:

tkeir-search-svc --config=<path to relation configuration file>