At Google Cloud, we strive to bring Site Reliability Engineering (SRE) culture to our customers not only through training on organizational best practices, but also with the tools you need to run successful cloud services. Part and parcel of that is comprehensive observability tooling—logging, monitoring, tracing, profiling and debugging—which can help you troubleshoot production issues faster, increase release velocity and improve service reliability.
We often hear that implementing observability is hard, especially for complex distributed applications that are implemented in different programming languages, deployed in a variety of environments, that have different operational costs, and many other factors. As a result, when migrating and modernizing workloads onto Google Cloud, observability is often an afterthought.
Nevertheless, being able to debug the system and gain insights into the system’s behavior is important for running reliable production systems. Customers want to learn how to instrument services for observability and implement SRE best practices using tools Google Cloud has to offer, but without risking production environments. With Cloud Operations Sandbox, you can learn in practice how to kickstart your observability journey and answer the question, “Will it work for my use-case?”
Cloud Operations Sandbox is an open-source tool that helps you learn SRE practices from Google and apply them on cloud services using Google Cloud’s operations suite (formerly Stackdriver). Cloud Operations Sandbox has everything you need to get started in one click:
Demo service – an application built using microservices architecture on modern, cloud-native stack (a modified fork of a Online Boutique microservices demo app)
One-click deployment – automated script that deploys and configures the service to Google Cloud, including:
Service Monitoring configuration
Tracing with OpenTelemetry
Cloud Profiling, Logging, Error Reporting, Debugging and more
Load generator – a component that produces synthetic traffic on the demo service
SRE recipes – pre-built tasks that manufacture intentional errors in the demo app so you can use Cloud Operations tools to find the root cause of problems like you would in production
An interactive walkthrough to get started with Cloud Operations
Launching the Cloud Operations Sandbox is as easy as can be. Simply:
Go to cloud-ops-sandbox.dev
Click on the “Open in Google Cloud Shell” button.
This creates a new Google Cloud project. Within that project, a Terraform script creates a Google Kubernetes Engine (GKE) cluster and deploys a sample application to it. The microservices that make up the demo app are pre-instrumented with logging, monitoring, tracing, debugging and profiling as appropriate for each microservices language runtime. As such, sending traffic to the demo app generates telemetry that can be useful for diagnosing the cloud service’s operation. In order to generate production-like traffic to the demo app, an automated script deploys a synthetic load generator in a different geo-location than the demo app.
It also adds and automatically configures uptime checks, service monitoring (SLOs and SLIs), log-based metrics, alerting policies and more.
At the end of the provisioning script you’ll get a few URLs of the newly created project:
You can follow the user guide to learn about the entire Cloud Operations suite of tools, including tracking microservices interactions in Cloud Trace (thanks to the OpenTelemetry instrumentation of the demo app) and see how to apply the learnings to your scenario.
Finally, to remove the Sandbox once you’re finished using it, you can run
Following SRE principles is a proven method for running highly reliable applications in the cloud. We hope that the Cloud Operations Sandbox gives you the understanding and confidence you need to jumpstart your SRE practice.