Continuous Testing

Last week I had the pleasure of being on a panel about Continuous Testing put on by Electric Cloud. There’s video of the discussion if you’re interested, here.

Additionally, I posted a blog post on the subject on the ClassDojo Engineering blog here.

For posterity, I’ll x-post here as well:

Continuous Testing at ClassDojo

We have thousands of tests and regularly deploy to production multiple times per day. This article is about all the crazy things we do to make that possible.

How we test our API

Our web API runs on node.js, but a lot of what we do should be applicable to other platforms.

On our API alone, we have ~2000 tests.

We have some wacky automated testing strategies:

  • We actually integration test against a database.

  • We actually do api tests via http and bring up a new server for each test and tear it down after.

  • We mock extremely minimally, because there’s little to gain performance-wise in our case, and we want to ensure that integrations work.

  • When we do more unit-type tests, it’s for the sake of convenience of testing a complicated detail of some component, and not for performance.

All ~2000 tests run in under 3 minutes.

It’s also common for us to run the entire suite dozens of times per day.

With that said, I usually do TDD when a defect is found, because I want to make sure that the missing test actually exposes the bug before I kill the bug. It’s much easier to do TDD at that point when the structure of the code is pretty unlikely to need significant changes.

We aim for ~100% coverage: We find holes in test coverage with the istanbul code coverage tool, and we try to close those holes. We’ve got higher than 90% coverage across code written in the last 1.5 years. Ideally we go for 100% coverage, but practically we fall a bit short of that. There are some very edge-case error-scenarios that we don’t bother testing because testing them is extremely time-consuming, and they’re very unlikely to occur. This is a trade-off that we talked about as a team and decided to accept.

We work at higher levels of abstraction: We keep our tests as DRY as possible by extracting a bunch of test helpers that we write along the way, including our api testing library verity.

Speeding up Builds

It’s critical that builds are really fast, because they’re the bottleneck in the feedback cycle of our development and we frequently run a dozen builds per day. Faster builds mean faster development.

We use a linter: We use jshint before testing, because node.js is a dynamic language. This finds common basic errors quickly before we bother with running the test suite.

We cache package repository contents locally: Loading packages for a build from a remote package repository was identified as a bottleneck. You could set up your own npm repository, or just use a cache like npm_lazy. We just use npm_lazy because it’s simple and works. We recently created weezer to cache compiled packages better as well for an ultra-fast npm install when dependencies have not changed.

We measured the real bottlenecks in our tests:

  1. Loading fixtures into the database is our primary bottleneck, so we load fixtures only once for all read-only tests of a given class. It’s probably not a huge surprise to people that this is our bottleneck given that we don’t mock-out database interaction, but we consider this an acceptable bottleneck given that the tests still run extremely fast.

  2. We used to use one of Amazon EC2’s C3 compute cluster instances for jenkins, but switched to a much beefier dedicated hosting server when my laptop could inexplicably run builds consistently faster than EC2 by 5 minutes.

We fail fast: We run full integration tests first so that the build fails faster. Developers can then run the entire suite locally if they want to.

We notify fast: Faster notification of failures/defects leads to diminished impact. We want developers to be notified ASAP of a failure so:

  • We bail on the build at the first error, rather than running the entire build and reporting all errors.
  • We have large video displays showing the test failure.
  • We have a gong sound effect when it happens.

We don’t parallelize tests: We experimented with test suite parallelization, but it’s really complicated on local machines to stop contention on shared resources (the database, network ports) because our tests use the database and network. After getting that to work, we found very little performance improvement, so we just reverted it for simplicity.

Testing Continues after Deployment

We run post-deployment tests: We run basic smoke tests post-deployment as well to verify that the deployment went as expected. Some are manually executed by humans via a service called Rainforest. Any of those can and should be automated in the future though.

We use logging/metrics/alerts: We consider logging and metrics to be part of on-going verification of the production system, so it should also be considered as part of our “test” infrastructure.

Our Testing Process

We do not have a testing department or role. I personally feel that that role is detrimental to developer ownership of testing, and makes continuous deployment extremely difficult/impossible because of how long manual testing takes. I also personally feel that QA personnel are much better utilized elsewhere (doing actual Quality Assurance, and not testing).

Continuous Testing requires Continuous Integration. Continuous Integration doesn’t work with Feature Branches. Now we all work on master (more on that here!).

We no longer use our staging environment. With the way that we test via http, integrated with a database, there isn’t really anything else that a staging server is useful for, testing-wise. The api is deployed straight to a small percentage ~5% of production users once it is built. We can continue testing there, or just watch logs/metrics and decide whether or not to deploy to the remaining ~95% or to revert, both with just the click of a button.

We deploy our frontends against the production api in non-user facing environments, so they can be manually tested against the most currently deployed api.

When we find a defect, we hold blameless post-mortems. We look for ways to eliminate that entire class of defect without taking measures like “more manual testing”, “more process”, or “trying harder”. Ideally solutions are automatable, including more automated tests. When trying to write a test for a defect we try as hard as possible to write the test first, so that we can be sure that the test correctly exposes the defect. In that way, we test the test.

Lessons learned:

  • Our testing practices might not be the best practices for another team. Teams should decide together what their own practices should be, and continually refine them based on real-world results.

  • Find your actual bottleneck and concentrate on that instead of just doing what everyone else does.

  • Hardware solutions are probably cheaper than software solutions.

  • Humans are inconsistent and less comprehensive than automated tests, but more importantly they’re too slow to make CD possible.

Remaining gaps:

  • We’re not that great at client/ui testing (mobile and web), mostly due to lack of effort in that area, but we also don’t have a non-human method of validating styling etc. We’ll be investigating ways to automate screenshots and to look for diffs, so a human can validate styling changes at a glance. We need to continue to press for frontend coverage either way.
  • We don’t test our operational/deployment code well enough. There’s no reason that it can’t be tested as well.
  • We have a really hard time with thresholds and anomaly detection and alerts on production metrics such that we basically have to watch a metrics display all day to see if things are going wrong.

The vast majority of our defects/outages come from these three gaps.