Skip to content

Relation clustering

Relation clustering allows to create class on SVO extracted during the Syntactic tagging phase.

Relations clustering configuration

Example of Configuration:

relations.json
{
    "logger": {
        "logging-level": "{{ project.loglevel }}"
    },
    "relations": {
        "cluster":{
            "algorithm":"kmeans",
            "number-of-classes":16,
            "number-of-iterations":16,
            "seed":123456,       
            "batch-size":4096, 
            "embeddings":
            {
                "server":{
                    "host":"0.0.0.0",
                    "port":10005,
                    "associate-environment": {
                        "host":"SENT_EMBEDDING_HOST",
                        "port":"SENT_EMBEDDING_PORT"
                    },
                    "use-ssl":false,
                    "no-verify-ssl":true
                },
                "aggregate":{
                    "configuration":"{{ project.path }}/configs/embeddings.json"
                }
            }
        },
        "clustering-model":{
            "semantic-quantizer-model":"{{ project.path }}/resources/modeling/relation_names.model.pkl",
            "train-if-not-exists":true
        },
        "network": {
            "host":"0.0.0.0",
            "port":10013,
            "associate-environment": {
                "host":"CLUSTER_INFERENCE_HOST",
                "port":"CLUSTER_INFERENCE_PORT"
            }
        },
        "runtime":{
            "request-max-size":100000000,
            "request-buffer-queue-size":100,
            "keep-alive":true,
            "keep-alive-timeout":5,
            "graceful-shutown-timeout":15.0,
            "request-timeout":60,
            "response-timeout":60,
            "workers":1
        }
    }
}

Relation clustering configuration is an aggreation of serialize configuration, logger (at top level). The clustering configuration allows to define embedding server access and clustering algorithms settings:

  • algorithm: ["kmeans","spericalkmeans" (Not yet available)],
  • number-of-classes: number of cluster classes,
  • number-of-iterations: number of kmeans iterations,
  • seed:kmeans seed
  • batch-size: we use mini batch kmeans, the batch size if the number of vectors send for partial fit,
  • embeddings : embedding server network information (host and port) or aggretion (server-less)
  • server : server configuration
  • aggregation : path to embedding configuration file

Configure Relations clustering logger

Logger is configuration at top level of json in logger field.

Example of Configuration:

logger configuration
{
    "logger": {
        "logging-level": "debug"
    }    
}

The logger fields is:

  • logging-level

It can be set to the following values:

  • debug for the debug level and developper information
  • info for the level of information
  • warning to display only warning and errors
  • error to display only error
  • critical to display only error

Relation clustering tool

To run the command type simply from tkeir directory:

python3 thot/relation_clustering.py --config=<path to relation configuration file> -i <path to file with syntactic data extracted> -o <path to output folder>