Chaos Engine experiments with AWS

In this quick start guide we will be demonstrating experiments with the AWS cloud platform.

Chaos engine host configuration on AWS EC2 instance.

The Chaos Engine is a container based service. Please use the Linux OS to configure the Chaos Engine host. When you complete the setup of the Chaos Engine host and run it successfully. It will create three containers.

  1. Chaos engine
  2. HCL Vault
  3. HCL vault loader

Pre-requisites.

Please provision EC2 linux instance. In this scenario we will be provisioning an Ubuntu 18.04 EC2 instance on which you need to install Docker and Docker compose. Which will used as the Chaos Engine host machine.

  • We will have to first create an AWS EC2 instance.
  • If you have an AWS Account login to AWS management console. If not, set up a free tier instance in Amazon’s * EC2. From the link, click Try Amazon EC2 for Free and enter the requested details to create your account. Once you have signed up, log in to your account.
  • Spin up/provision couple of Linux (Ubuntu) EC2 instance for the Chaos Engine.
  • Follow the EC2 instance creation instructions here.
  • In this scenario we will have total 3 AWS EC2 instance one is for host purpose and remaining 2 are target instance.
  • Open the Amazon EC2 dashboard. Find your new running instance in the list and click to highlight it. Click Connect for information on how to connect to this instance using SSH.

Following is the snippet from AWS console on EC2 instance.

$ ssh -i "YourKeyName.pem" ubuntu@publicDNSname

After successful login please start installation of the Chaos Engine host.

Step 1: Docker Installation

Follow the Docker installation instructions here

Verify Docker Installation

To verify Docker installation run docker ps.

Expected output is:

user@host:~$ docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

If the output looks similar to following example open new terminal session or simply logout and login again.

user@host:~$ docker ps
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/json: dial unix /var/run/docker.sock: connect: permission denied

Step 2: Install Docker Compose

For detailed instructions on how to install the docker compose see official Docker documentation.

Verify Docker Compose Installation

user@host:~$ docker-compose -v
docker-compose version 1.25.0, build 0a186604

If the command docker-compose fails after installation, check your path. You can also create a symbolic link to /usr/bin or any other directory in your path.

Step 3: Download The Latest Chaos Engine Sources

user@host:~$ git clone https://github.com/thalesgroup/chaos-engine
Cloning into 'chaos-engine'...
...
Checking connectivity... done.

Step 4: Pull Chaos Engine Image

Pull the latest Chaos Engine image from DockerHub.

user@host:~$ cd chaos-engine/
user@host:~/chaos-engine$ docker pull thalesgroup/chaos-engine:latest

Step 5: Configure

Basic Framework Setup

user@host:~/chaos-engine$ echo "holidays=NONSTOP
VAULT_TOKEN=00000000-0000-0000-0000-000000000000
VAULT_SCHEME=http
VAULT_HOST=vault
VAULT_PORT=8200
VAULT_10=true
SPRING_PROFILES_ACTIVE=DEVELOPMENT

automatedMode=false
CHAOS_SECURITY_ENABLED=false" > .env

Configure Experiment Modules

Let's say you want to execute experiments targeting a Kubernetes cluster. First check experiment modules documentation to see what config options are available then add all configuration properties into vault-secrets.json file located in ~/developer-tools/vault-loader. Example content of vault-secrets.json

{
  "kubernetes": "",
  "kubernetes.url": "https://127.127.127.127",
  "kubernetes.token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c",
  "kubernetes.averageMillisPerExperiment": "30000"
}
Example content of EC2 instance for vault-secrets.json
{
"aws.ec2":"true",  
"aws.accessKeyId":"<access_key_id>",  
"aws.secretAccessKey":"<Secrate_access_String>", 
"aws.region":"us-east-1",  
"aws.ec2.filter.tag.ChaosVictim":"true", 
"aws.ec2.averageMillisPerExperiment":"30000",
"aws.ec2.sshprivatekeys.<your_Key_Name>":"<private_key>”  
}

Importance Note:

  • Please make sure you grant access to your IAM user to have FullEC2access
  • The “ChaosVictim” = “true” tag must be existing in the target EC2 instances
  • Please allow all the required ports 80, 8080, 8200 and 22 on the Chaos Engine host machine.

Configure DataDog Integration

Enable DataDog

First of all, you have to sign up with Datadog to enable the integration with chaos-engine.

Kindly go to the below URL to create your account in Datadog. Datadog sign up You will get 14 days free trial account. Please sign up for free account for now to test it.

Once you have account setup you, login to datadog portal -> API’s -> hover the mouse on the key, and your API Key will be displayed.

If you are going to ship data to DataDog run following command where $YOUR_API_KEY will be replaced by your real DataDog API key.

user@host:~/chaos-engine$ echo "DD_API_KEY=$YOUR_API_KEY" > .datadog_api_key 

Disable DataDog

If you don't need DataDog integration keep .datadog_api_key empty

user@host:~/chaos-engine$ touch .datadog_api_key 

Step 6: Start chaos engine host

Start the Engine by running docker-compose up

user@host:~/chaos-engine$ docker-compose up
When you see Experiments total count: 0 your Chaos Engine instance is up and ready
chaosengine_1   | {"@timestamp":"2019-11-28T18:07:36.491Z","@version":"1","message":"Experiments total count: 0","logger_name":"com.thales.chaos.experiment.ExperimentManager","thread_name":"chaos-10","level":"INFO","level_value":20000,"count":0,"env":"DEVELOPMENT","chaos-host":"b4bd5f0829d6@172.18.0.4"}
Once you are able to see the above logs in chaos host that means you are ready for chaos experiments. You can look for the logs related to the Chaos Engine host in Datadog as well. Following is snippet from the Datadog.

Chaos Engine Experiments

Each Experiment Module is responsible for interacting with the API endpoint of its appropriate Cloud Platform or Orchestration tool. The Experiment Modules discover nodes that can be experimented upon, ensuring experiments keep a minimum blast radius, and performing the API calls necessary to create an experiment. Kindly refer the below URL for the same.

Experiment Modules

There are two modes of experiments in the Chaos Engine.

1 Automated mode

In automated mode experiment selection as well as platform is done randomly.

2 User defined mode

In user defined experiment you can run shell and API experiments. You can choose platform as well as the experiment type. In order to run the user defined experiments you need to provide an input to experiments.

Note: Shell experiments require Cattle resources, that means your target resource must be backed by some kind of auto scaling mechanism. E.g. in EC2 resource case the instance must be backed by AutoScalingGroup. In AWS EKS resource case the pod must have a deployment with more than one replicas etc.

User define experiments The parameter for a User Defined Experiment is a single object with two variables. The platformType variable should be the name of the Platform you wish to experiment on. The experimentCriteria variable requires an object of containerIdentifier, experimentMethods[], and optional specificContainerTargets[], to identify the aggregate container group, the type of experiments to run, and any specific targets which may not be identical (i.e., a MongoDB Primary node).

The Experiment Structure object can be sent as the data stream of a POST request to /experiment/build to create the experiments. The server expects a Content-Type of application/json. See the REST API for more information. Following is the sample JSON for user defined experiments. You may find more information about this at

Example of user define API experiments

{
    "platformType": "KubernetesPlatform",
    "experimentCriteria": [
        {
            "containerIdentifier": "nginx-deployment-574b87c764",
            "experimentMethods": [
                "deletePod"
            ],
            "specificContainerTargets": [ ]
        }
    ]
}

Example of user defined Shell experiments

{
    "platformType": "KubernetesPlatform",
    "experimentCriteria": [
        {
            "containerIdentifier": "nginx-deployment-574b87c764",
            "experimentMethods": [
                "forkBomb.sh"
            ],
            "specificContainerTargets": [ ]
        }
    ]
}

Experiment Modules

Following are the list of experiments are available in the Chaos Engine. Kindly refer the below url for list of available experiments.

Experiments Moudle

AWS EC2 instance Kubernetes AWS RDS Pivotal Cloud foundry Shell Experiments
Stop Delete at POD layer Take a snapshot Restart BurnIO
Restart Shell experiment at Container layer Restart Rescale CPUBurn
Deletion of Security Group Deletion of Security Group Restage DNS Block
Instance termination only in Auto Scaling Group Random restart of cluster nodes Shell Experiments Fill Disk
Shell experiments Failover Forkbomb
Ram consumer
Null route
Generator starvation
Process killer

Run Experiments

In order to run the experiments, make sure that docker-compose is up and running on the Chaos Engine host. After that you can login to Swagger UI to run the experiments.

Following is the URL for swagger UI you need to replace IP address of you host.

http://yourpublicIP:8080/swagger-ui.html#/

To check the available targets/platform for the experiments you can execute following from swagger UI. Look for the platform section and click on Get /Platform.

Click on execute.

Refer the response body section. In the below screen shot you can see instance details such as name, id and tags, which we have created as a targets to perform the experiments. There are total 2 instance or targets are available for experiments.

Next go the Experiments section and run the experiments. POST /experiments/start.

Now we are going to start the experiment by clicking on /experiment/start (Start Random Experiment) Click on /experiment/start -> try it out -> Execute.

The Chaos Engine has chosen “removeSecurityGroups” experiment randomly. It will remove the security group for an EC2 instance. Kindly refer the below screen shot from the AWS console.

Logs from the Chaos Engine host.

chaosengine_1   | {"@timestamp":"2020-09-09T06:36:24.124Z","@version":"1","message":"Evaluated experiment: Experiment of removeSecurityGroups against chaosengtarget2 (Sabilchaos) [i-06963df97e880ef98]","logger_name":"com.thales.chaos.experiment.ExperimentManager","thread_name":"ForkJoinPool-4-worker-3","level":"INFO","level_value":20000,"experimentMethod":"removeSecurityGroups","host":"i-06963df97e880ef98","experimentid":"880f3119-a86e-4420-adbf-dee8bdfdd21f","platform":"AwsEC2Platform","experiment":{"id":"880f3119-a86e-4420-adbf-dee8bdfdd21f","experimentState":"STARTED","container":{"shellCapabilities":{},"availabilityZone":"us-east-1a","instanceId":"i-06963df97e880ef98","keyName":"Sabilchaos","name":"chaosengtarget2","publicAddress":"54.227.31.237","imageId":"ami-0761dd91277e34178","groupIdentifier":"No Grouping Identifier Found","simpleName":"chaosengtarget2 (Sabilchaos) [i-06963df97e880ef98]","routableAddress":"54.227.31.237","aggregationIdentifier":"chaosengtarget2","cattle":false,"sshcapable":false,"containerType":"AwsEC2Container","experimentStartTime":1599633339.884297000,"identity":3165379631,"knownMissingCapabilities":[]},"experimentType":"NETWORK","selfHealingMethod":{},"startTime":1599633339.884297000,"lastSelfHealingTime":null,"selfHealingCounter":0,"experimentMethodName":"removeSecurityGroups","experimentLayerName":"AwsEC2Container","wasSelfHealingRequired":null},"env":"PROD","chaos-host":"bb8c3770c057@aws:i-086601e7b0a84aa54:257969174391:us-east-1"}
Logs from Datadog.

After completion of experiment you can see Chaos Engine has reverted Security Group as part of the experiment. Below is the snippet for the same.

Note: - you can configure/enable monitoring tool to get the real time analysis of a targets.

Vault Secret modification.

In the Chaos Engine experiments we are using a HCL Valut to store the crednatils. As you see that we need to update the "vault-secerets.json" file to maintain the AWS Access and Secret key to access the AWS resources. If you would to to add new experiments or update the extisting information of a valut you can make those changs.

There are two ways to do the modifications in vault secerets.

1 Updating the vault-secrets.json file

Once you update the vault-secret.json you need to run the following command to take the changes effect.

$ docker-compose build vault vault-loader
2 Update using Valut GUI.

To login to the vault GUI please use the following URL and just update the IP address of the Chaos Engine Host.

http://yourpublicIP:8200/ui

Once you hit the above URL you will be ask for credentials. You can choose method is “Token” and token value is “00000000-0000-0000-0000-00000000000”

Go to secret update the changes and save it.

Make the changes/modification to the above parameters and then run the Refresh from Swagger UI. To take the changes affect.

Experiments on AWS Elastic Kubernetes Services EKS.

We will performing an experiment on AWS EKS cluster. As part of the pre-requisite please create an AWS EKS. Kindly refer Kubernetes configuration

In this scenario we have created a EKS with 2 nodes and 1 Nginx deployment. Following the is the screenshot for EKS from AWS.

Once you have EKS cluster ready. Proceed with Nginx deployement.

kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml

$ kubectl get pods
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-574b87c764-jlgwr   1/1     Running   0          19h
nginx-deployment-574b87c764-kbmqh   1/1     Running   0          3h18m
nginx-deployment-574b87c764-rlv6f   1/1     Running   0          19h

$ kubectl get deployment
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   3/3     3            3           19h

$ kubectl get rs
NAME                          DESIRED   CURRENT   READY   AGE
nginx-deployment-574b87c764   3         3         3       19h

Now, get the secret information by executing the following command to connect using Chaos Engine. You need to have token, namespace values to update the secrets.

$ kubectl describe secret
Name:         chaos-engine-serviceaccount-token-r6ptr
Namespace:    default
Labels:       <none>
Annotations:  kubernetes.io/service-account.name: chaos-engine-serviceaccount
              kubernetes.io/service-account.uid: 7ec6da4a-a006-49e3-ac5e-16735be3853c

Type:  kubernetes.io/service-account-token

Data
====
ca.crt:     1025 bytes
namespace:  7 bytes
token:      eyJhbGciOiJSUzI1NiIsImtpZCI6IllIX2cyMGtaSjVsVTUzZmFHdTVRS3U2TTZCdU9Wc0dUSk9HUnByN2t1VXMifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJkZWZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImNoYW9zLWVuZ2luZS1zZXJ2aWNlYWNjb3VudC10b2tlbi1yNnB0ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJjaGFvcy1lbmdpbmUtc2VydmljZWFjY291bnQiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiI3ZWM2ZGE0YS1hMDA2LTQ5ZTMtYWM1ZS0xNjczNWJlMzg1M2MiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6ZGVmYXVsdDpjaGFvcy1lbmdpbmUtc2VydmljZWFjY291bnQifQ.Spj9BlS3Fq1s7SOIE4o_ySRYD6G0yMY_hUzSkLenymUPNaUDNbI3B6w_6VIaYkOUu34_nx5yOOD1RJWktXHqMmmEb_qjl5HnjsP41ugQU_LSQfIraSLbozE4ZP92Yx9Leq2bVYQN6JSv71GBOQOHRE91OSFfW2ENjna_MHHkgfw5usDZw3vF6d8lBjJA0DVbhgeZlAL9PPpaGLj-QjB6p2gg-BXXCC5dUiJsWZ77w8SwLyBx4NdKaLldPUisck-c8PVmGqWIawpHFM_C6fAxVeJv0TiyVou8Y-_7DGJzxhEDvSZ5Wu01qt0gp0mFcADYHorhkyi80CYVOTp22HuQHQ

Once you have the above information. Now you need to update vault-secrets.json or update the same information in vault.

Please make sure that updated information should take effect. Kindly refer Vault Secret modification.

{
"kubernetes": "",
  "kubernetes.url": "https://F14806126C9620A678A237F7C4AFE7AF.gr7.us-east-2.eks.amazonaws.com", -> you can get this from AWS EKS console API server endpoint. 
  "kubernetes.token":  -> can be taken from kubectl describe secrets command. "eyJhbGciOiJSUzI1NiIsImtpZCI6IllIX2cyMGtaSjVsVTUzZmFHdTVRS3U2TTZCdU9Wc0dUSk9HUnByN2t1VXMifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJkZWZhdWx0Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZWNyZXQubmFtZSI6ImNoYW9zLWVuZ2luZS1zZXJ2aWNlYWNjb3VudC10b2tlbi1yNnB0ciIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJjaGFvcy1lbmdpbmUtc2VydmljZWFjY291bnQiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC51aWQiOiI3ZWM2ZGE0YS1hMDA2LTQ5ZTMtYWM1ZS0xNjczNWJlMzg1M2MiLCJzdWIiOiJzeXN0ZW06c2VydmljZWFjY291bnQ6ZGVmYXVsdDpjaGFvcy1lbmdpbmUtc2VydmljZWFjY291bnQifQ.Spj9BlS3Fq1s7SOIE4o_ySRYD6G0yMY_hUzSkLenymUPNaUDNbI3B6w_6VIaYkOUu34_nx5yOOD1RJWktXHqMmmEb_qjl5HnjsP41ugQU_LSQfIraSLbozE4ZP92Yx9Leq2bVYQN6JSv71GBOQOHRE91OSFfW2ENjna_MHHkgfw5usDZw3vF6d8lBjJA0DVbhgeZlAL9PPpaGLj-QjB6p2gg-BXXCC5dUiJsWZ77w8SwLyBx4NdKaLldPUisck-c8PVmGqWIawpHFM_C6fAxVeJv0TiyVou8Y-_7DGJzxhEDvSZ5Wu01qt0gp0mFcADYHorhkyi80CYVOTp22HuQHQ",
  "kubernetes.averageMillisPerExperiment": "30000"
}

Then create a file called chaos-engine-service-account.yaml for role binding and make the modification in your file. Replace {{namespace}} with actual namespace and then execute it.

You may refer following for EKS cluster configuration for Chaos Engine. EKS configuration for the Chaos Engine

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: chaos-engine-role
  namespace: {{namespace}}
rules:
- apiGroups:
  - apps
  resources:
  - daemonsets
  - daemonsets/status
  - deployments
  - deployments/status
  - replicasets
  - replicasets/status
  - statefulsets
  - statefulsets/status
  verbs:
  - get
  - list
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - delete


- apiGroups:
  - ""
  resources:
  - pods
  - pods/status
  - replicationcontrollers/status
  verbs:
  - get
  - list

- apiGroups:
  - ""
  resources:
  - pods/exec
  verbs:
  - create
  - get

---

apiVersion: v1
kind: ServiceAccount
metadata:
  name: chaos-engine-serviceaccount
  namespace: {{namespace}}

---

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: chaos-engine-rolebinding
  namespace: {{namespace}}
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: chaos-engine-role
subjects:
- kind: ServiceAccount
  name: chaos-engine-serviceaccount
  namespace: {{namespace}}

Execute the YMAL file.

kubectl apply -f chaos-engine-service-account.yaml

Once you update the secrets go ahead and run get platform from Swagger UI.

Now you can initiate the experiments using Automated or user defined method. After running the experiments, you should have a monitoring configured or enable for your K8 nodes so that you will be able to get the real data.

We are going to run the experiments on AWS EKS cluster. Once we executed the run experiment API from Swagger UI you will see the following response.

You can see in the above experiment response code output that the Chaos Engine is going to run the experiments on AWS EKS. Please see the above highlight section of a response code, where in you can see experiment platform as a EKS pods and experiment name is memoryConsumer.sh .

Logs captured from the Datadog for this experiment. Here are the logs from the Chaos Engine host which says that experiments started.

chaosengine_1   | {"@timestamp":"2020-09-09T06:57:12.614Z","@version":"1","message":"Evaluated experiment: Experiment of memoryConsumer.sh against nginx-deployment-574b87c764-7wp8h (default)","logger_name":"com.thales.chaos.experiment.ExperimentManager","thread_name":"ForkJoinPool-42-worker-3","level":"INFO","level_value":20000,"experimentMethod":"memoryConsumer.sh","host":"nginx-deployment-574b87c764-7wp8h","experimentid":"df244f9a-85f6-4bad-80b5-5787c64a43ec","platform":"KubernetesPlatform","experiment":{"id":"df244f9a-85f6-4bad-80b5-5787c64a43ec","experimentState":"STARTED","container":{"shellCapabilities":{"sleep":true,"dd":true,"awk":true,"/bin/sh":true,"grep":true,"cat":true},"uuid":"a6eb617c-39e9-4fc3-a333-e400f335f95c","podName":"nginx-deployment-574b87c764-7wp8h","namespace":"default","ownerKind":"REPLICA_SET","ownerName":"nginx-deployment-574b87c764","targetedSubcontainer":"nginx","simpleName":"nginx-deployment-574b87c764-7wp8h (default)","aggregationIdentifier":"nginx-deployment-574b87c764","cattle":true,"containerType":"KubernetesPodContainer","experimentStartTime":1599634333.020662000,"identity":3504227930,"knownMissingCapabilities":[]},"experimentType":"STATE","selfHealingMethod":{},"startTime":1599634333.020662000,"lastSelfHealingTime":null,"selfHealingCounter":0,"experimentMethodName":"memoryConsumer.sh","experimentLayerName":"KubernetesPodContainer","wasSelfHealingRequired":null},"env":"PROD","chaos-host":"bb8c3770c057@aws:i-086601e7b0a84aa54:257969174391:us-east-1"}

This experiment will spike up the memory on the container. You can check the memory utilization in monitoring tool or in the nodes using the following.

kubectl top nodes

Or else, you can SSH to the EKS nods and run TOP command see the memory useage.

Here I can see the experiment has been finishd in the Datadog logs.

Summary

Congratulations, you just ran your first Chaos Engine experiment! Feel free to run the experiment many times, varying parameters and seeing how that impacts your instance. Next steps could be to add a full monitoring solution to the instance for better data. (we recommend doing that with a non-production, non-vital server at this stage, should you go this route).

When you are done testing using the Amazon instance created for this article, don’t forget to terminate the Amazon instance.

Best wishes and have fun unleashing the Chaos Engine! :)

References