What we learnt by migrating from CircleCI to Buildkite

Why did we roll out a new CI system?

Choosing a new CI platform

  • The CI configuration would be trackable via source control. Example: In the case of CircleCI, we checked-in a huge YAML file in our source code repositories.
  • The new CI system should allow us to get rid of YAML if possible. Why? If not YAML, then what? (well, wait till you read about it later on this blog post)
  • We were typically looking for self-hosted and semi-self-hosted CI platforms. (because they seem to align more with our cost requirements and also give us more control)
  • Ability to run CI jobs on various platforms. We have a variety of workloads which get scheduled on various targets like Linux x86 and arm64 hosts, docker containers, and macOS machines.
  • Autoscaling build nodes. We didn’t want our build nodes to be always running — instead we would like to bring up and bring down them based on the demand.
  • Various triggering strategies — like trigger a build for commit in a branch/pull request (with pattern matching), trigger a build manually, auto-cancel triggered builds when there were newer commits on the pull request..
  • Reasonable pricing, good support and active community.

Buildkite

First thing we did

One (unwritten) rule

resource "aws_iam_role" "build_node_role" { name = "BuildNodeRole" assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [{ Action = "sts:AssumeRole" Effect = "Allow" Principal = { Service = "ec2.amazonaws.com" } }] }) managed_policy_arns = [ aws_iam_policy.build_agent_setup_secret_read.arn, aws_iam_policy.sample_secrets_read.arn, aws_iam_policy.ci_secrets_read.arn, aws_iam_policy.nomad_client_secrets_read.arn, aws_iam_policy.nomad_service_discovery.arn, aws_iam_policy.build_node_metadata_read.arn, aws_iam_policy.build_node_autoscale_helper.arn, aws_iam_policy.build_node_artifact_store_access.arn, aws_iam_policy.dockerhub_public_repo_read_token.arn, aws_iam_policy.hasura_bot_github_token_read.arn ] }

CI Trigger labels

  1. We automatically shadow every pull request we receive in our graphql-engine oss repository to the monorepo (and it is not safe to run CI on all of them by default — you could understand why by reading this blog post from GitHub).
  2. We saw a 30% drop in the number of builds when we explicitly asked people to add a label to their pull requests when it is ready for using CI. So this was also a cost optimization scheme.

Dynamic pipelines

agents: queue: ops-node steps: - label: ":pipeline: generate pipeline" key: "pipeline-gen" plugins: - hasura/smooth-secrets#v1.2.1: # sets up some secrets here - hasura/smooth-checkout#v2.0.0: config: - url: [email protected]:hasura/graphql-engine-mono.git command: ".buildkite/scripts/buildkite-upload.sh"
#!/bin/bash set -eo pipefail export BUILDKITE_PIPELINE_NO_INTERPOLATION=true export PATH="$PATH:/usr/local/go/bin" echo "--- :Buildkite: build information" echo "BUILDKITE_BUILD_ID = $BUILDKITE_BUILD_ID" echo "--- :shell: linting CI shell scripts" shellcheck .buildkite/scripts/{.,**}/*.sh -P .buildkite/scripts echo "--- :pipeline: generating pipeline" cd .buildkite/pipeline-gen go run main.go echo "--- :pipeline_upload: uploading pipeline.json" /usr/bin/buildkite-agent pipeline upload pipeline.json

Why Go?

steps: - command: echo "Checks out repo at given ref" plugins: - hasura/smooth-checkout#v1.1.0: clone_url: https://github.com/<username>/<reponame> ref: <ref>
steps: - command: echo "Checks out repo at given ref" plugins: - hasura/smooth-checkout#v2.0.0: config: - url: [email protected]:<username>/<reponame>.git ref: <ref>
package plugin type SmoothCheckout struct { CloneURL *string `json:"clone_url,omitempty"` Ref *string `json:"ref,omitempty"` SkipCheckout *bool `json:"skip_checkout,omitempty"` } func (SmoothCheckout) Name() string { return "hasura/smooth-checkout" } func (SmoothCheckout) Version() string { return "v1.1.0" }
package plugin type SmoothCheckoutConfig struct { URL string `json:"url"` Ref *string `json:"ref,omitempty"` } type SmoothCheckout struct { Config []SmoothCheckoutConfig `json:"config,omitempty"` SkipCheckout *bool `json:"skip_checkout,omitempty"` } func (SmoothCheckout) Name() string { return "hasura/smooth-checkout" } func (SmoothCheckout) Version() string { return "v2.0.0" }
func makeAllStepsToBeRetryable(p *pipeline) { for idx, s := range p.Steps { cmdStep, ok := s.(step.Command) if ok { if cmdStep.Retry == nil { cmdStep.Retry = &step.CommandRetry{} } if cmdStep.Retry.Manual == nil { cmdStep.Retry.Manual = &step.ManualRetry{ PermitOnPassed: util.Bool(true), } } p.Steps[idx] = cmdStep } } }

Agent Queues

Autoscaling build nodes

  • Get the number of “waiting” jobs from Buildkite API for a given agent queue
  • Get the desired node count of an AWS autoscaling group (in which a build node for a particular agent queue will be spun up)
  • If “waiting jobs count” is greater than “desired node count of AWS autoscaling group”, set a higher desired count, so that enough new nodes come up for handling the waiting jobs.

Artifacts

Secrets

Caching

  • We run most of the builds as docker containers. So we would definitely like to cache the docker images which are used for running these CI jobs.
  • Any software programs (like command line tools) that are going to be used during the build.
  • Software libraries (which are installed by package managers like npm, cabal, etc.)

Monitoring and debugging

Future

--

--

--

⚡️ Instant realtime GraphQL APIs! Connect Hasura to your database & data sources (GraphQL, REST & 3rd party API) and get a unified data access layer instantly.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Hasura

Hasura

⚡️ Instant realtime GraphQL APIs! Connect Hasura to your database & data sources (GraphQL, REST & 3rd party API) and get a unified data access layer instantly.

More from Medium

AWS MSK and Confluent. Are they really Serverless?

Idempotency 102: serverless enters the scene

When Less is More: Serverless NAT Gateway — Part 1

Serverless AWS Lambda Function Monitoring Made Easy