Relation clustering

Relation clustering allows to create class on SVO extracted during the Syntactic tagging phase.

Relations clustering configuration

Example of Configuration:

relations.json

{
    "logger": {
        "logging-level": "{{ project.loglevel }}"
    },
    "relations": {
        "cluster":{
            "algorithm":"kmeans",
            "number-of-classes":16,
            "number-of-iterations":16,
            "seed":123456,       
            "batch-size":4096, 
            "embeddings":
            {
                "server":{
                    "host":"0.0.0.0",
                    "port":10005,
                    "associate-environment": {
                        "host":"SENT_EMBEDDING_HOST",
                        "port":"SENT_EMBEDDING_PORT"
                    },
                    "use-ssl":false,
                    "no-verify-ssl":true
                },
                "aggregate":{
                    "configuration":"{{ project.path }}/configs/embeddings.json"
                }
            }
        },
        "clustering-model":{
            "semantic-quantizer-model":"{{ project.path }}/resources/modeling/relation_names.model.pkl",
            "train-if-not-exists":true
        },
        "network": {
            "host":"0.0.0.0",
            "port":10013,
            "associate-environment": {
                "host":"CLUSTER_INFERENCE_HOST",
                "port":"CLUSTER_INFERENCE_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    }
}

Relation clustering configuration is an aggreation of serialize configuration, logger (at top level). The clustering configuration allows to define embedding server access and clustering algorithms settings:

algorithm: ["kmeans","spericalkmeans" (Not yet available)],
number-of-classes: number of cluster classes,
number-of-iterations: number of kmeans iterations,
seed:kmeans seed
batch-size: we use mini batch kmeans, the batch size if the number of vectors send for partial fit,
embeddings : embedding server network information (host and port) or aggretion (server-less)
server : server configuration
aggregation : path to embedding configuration file

Configure Relations clustering logger

Logger is configuration at top level of json in logger field.

Example of Configuration:

logger configuration

{
    "logger": {
        "logging-level": "debug"
    }    
}

The logger fields is:

logging-level

It can be set to the following values:

debug for the debug level and developper information
info for the level of information
warning to display only warning and errors
error to display only error
critical to display only error

Relation clustering tool

To run the command type simply from tkeir directory:

python3 thot/relation_clustering.py --config=<path to relation configuration file> -i <path to file with syntactic data extracted> -o <path to output folder>