Google Cloud Compute Engine Experiments

SDK

The GCP Compute Engine module uses the Official GCP Compute SDK for Java from Google.

Version

All Google SDKs are included via the Google Cloud libraries-bom Maven package. The current version of the package is 10.1.0.

Configuration

Key Name Description Default Mandatory
gcp.compute The presence of this key enables the module. N/A Yes
gcp.json-key This key should be the JSON Key of the Service Account the module is to use. N/A Yes
gcp.project-id This key controls which GCP Project the module will experiment on. N/A Yes
gcp.compute.include-filter.<metadata-key-name> Used for filtering the inclusion of GCP Compute Engine Instances based on the presence of a specific key/value pair of Metadata. See Filtering for more information. N/A No
gcp.compute.exclude-filter.<metadata-key-name> Used for filtering the inclusion of GCP Compute Engine Instances based on the presence of a specific key/value pair of Metadata. See Filtering for more information. N/A No
gcp.compute.routableCidrBlocks A comma separated list of private CIDR Blocks that should be considered Routable for SSH Access N/A No

Credential Sharing Across GCP Modules

Enabled GCP Modules share credentials amongst each other.

Required Permissions

Each experiment below lists the specific API calls it makes. These API calls map 1-to-1 with individual IAM permissions.

If you do not wish to manage maintaining a role for Chaos Engine, the roles.editor role can be used instead, but be aware that this role contains many powerful permissions that are unnecessary for the Chaos Engine to operate.

Node Discovery

Nodes are discovered using the Compute instances.list API. The results are parsed in all zones and converted into Java objects.

Filtering

The GCP Compute platform supports both inclusive and exclusive filtering based on instance metadata key/value pairs. If any include-filters are specified, all must exist in the metadata of the instance. Similarly, if any exclude-filters are specified, none must exist in the metadata of the instance.

The filter values are case-sensitive.

Self Awareness

The GCP Compute platform uses the Google Cloud Instance Metadata Server to discover its own Google Cloud Resource ID. Instances are evaluated against that resource ID and removed from the pool of potential experiments.

SSH Experiment Support

API: Compute instances.setMetadata, Compute instances.get, Compute zoneOperations.get

A Google Compute instance is considered SSH Accessible if it either has a Public NAT on nic0, or if the private address on nic0 is a member of the CIDR Blocks in the gcp.compute.routableCidrBlocks configuration parameter.

On connection of any instance, Chaos Engine will create a new SSH Key and append it to the ssh-keys metadata field of the instance. The private key is never transmitted outside of Chaos Engine. This is accomplished with the Compute instances.setMetadata and Compute instances.get API, to retrieve the old metadata, alter the specific field, and set it back. The previous fingerprint is sent for consistency purposes, allowing one retry.

The Compute zoneOperations.get API is polled until the setMetadata operation is complete. A verification is made and then an SSH Connection is initiated.

Experiment Methods

Simulate Maintenance Event

Google Cloud Compute regularly performs Maintenance Events on their physical hosts, for operations such as kernel upgrades, or hardware maintenance. When they perform these tasks, they take actions on all Virtual Machines on that host. A Compute Engine VM may be live-migrated to another host, or it may be terminated and recreated (requires configuration). This experiment tests to validate that no unforeseen problems occur as a result of these maintenance events, which may happen at any time.

Mechanism

API: Compute instances.simulateMaintenanceEvent

The simulateMaintenanceEvent API is called against the Instance UUID. This operation performs the real action of either live-migrating or replacing the VM.

Health Check

API: Compute zoneOperations.get

The operation status is called and checked. The experiment is considered finished when the operation returns Progress >= 100.

Self Healing

Because this is an entirely cloud managed operation, Self Healing is not possible. Once the operation has been started, it cannot be stopped.

Stop Instance

Instances that are not part of instance groups can be stopped. A VM can be configured to be automatically restarted if it is in an unexpected stopped state, but this takes time to recognize and accomplish.

Mechanism

API: Compute instances.stop

The stop API is called against the instance.

Health Check

API: Compute zoneOperations.get, Compute instances.get

The operation for the stop is polled until complete. Then, the instance is specifically called using the get API, and the status field compared. If the status is not "RUNNING", then the experiment is still in progress. If the get call returns an HTTP 404, the instance no longer exists and the experiment is considered failed.

Self Healing

API: Compute instances.start

The start API is called to self-heal the instance after the experiment duration.

Reset Instance

Instances that are not part of instance groups can be reset. The VM is restarted, and temporarily unavailable. If the startup sequence is extensive, it may result in full application stack issues.

Mechanism

API: Compute instances.reset

The reset API is called against the instance.

Health Check

API: Compute zoneOperations.get, Compute instances.get

The operation for the stop is polled until complete. Then, the instance is specifically called using the get API, and the status field compared. If the status is not "RUNNING", then the experiment is still in progress. If the get call returns an HTTP 404, the instance no longer exists and the experiment is considered failed.

Self Healing

API: Compute instances.start

The start API is called to self-heal the instance after the experiment duration.

Recreate Instance in Instance Group

Recreating an instance that is part of an instance group replaces and reinitializes a VM. This operation is similar to how Google Cloud will heal an instance that is failing health checks. This experiment may find errors in how an instance group behaves when it is below capacity by one instance, or issues with rolling out an old image after an update in opportunistic mode has started.

Mechanism

API: Compute instanceGroupManagers.recreateInstances

The recreateInstances API is called against the managed instance group, passing the specific instance as a parameter.

Health Check

API: Compute zoneOperations.get, Compute instanceGroup.get, Compute instanceGroupManager.get, Compute regionInstanceGroup.get, Compute regionInstanceGroupManager.get

The operation status is called and checked. If the operation is completed, additionally the instance group size and target size are called via the (region)InstanceGroup and associated Manager API's. If the target and actual size are equal, then the instance group manager has properly resolved the capacity.

Self Healing

Because this is an entirely cloud managed operation, Self Healing is not possible. Once the operation has been started, it cannot be stopped.

Finalization

After the experiment is finished, Chaos Engine will perform a Compute instances.get operation against the specific instance name it experimented upon. While most other metadata will remain the same, the new version of the instance will have a new unique identifier that needs to be updated locally for future experiments.

Remove Firewall Tags from Instance

Removing firewall tags from an instance will block the flow of traffic into the instance. This can simulate a network availability issue for the one specific instance. This experiment is only applicable to instances that are not part of instance groups.

Mechanism

API: Compute instances.get, Compute instances.setTags

The latest tags and their fingerprint are fetched using the Compute instances.get API. The fingerprint is used along with an empty set of tags in the Compute instances.setTags API.

Health Check

API: Compute zoneOperations.get, Compute instances.get

The operation status from the original setTags operation is checked. If the operation is completed, the instance data is retrieved and the contents of the tags are compared. If the original tags that were fetched during the experiment startup match the current tags, the experiment is considered complete. Tag order is irrelevant in this comparison.

Self Healing

API: Compute instances.get, Compute instances.setTags

The new tag fingerprint is retrieved using Compute instances.get, and the original tags from setup are pushed back using Compute instances.setTags.