Our release process in detail

Made to mitigate known problems

Jean-Pierre Lambert

Published in

Jean-Pierre Lambert's blog

10 min readOct 16, 2017

Cet article est également disponible en français.

I briefly introduced in a previous article our release process, putting a lot of emphasis on how we used it as a visual management tool.

From 1 release per quarter to 2+ releases per week

How we used visual management and gamification for our releasing process

medium.com

In this article we’ll dive into the process itself, highlighting interesting elements.

What are the steps of the process ?

To help Google index the content of the image but also to help those who don’t understand French, here is a transcription in English of the various steps of the workflow:

Plan the release: define the scope of the release
Estimate when we expect to push a release candidate in staging environment, ready for verification by partners
Communicate said estimated date of availability in staging environment to the partners
Work on individual tasks/User Stories and complete them
Eventually, discover during development that we must change the scope of the release
If we have updated the scope of the release, then we must re-estimate the date of availability in staging environment, and communicate it again to the partners (= go back to step 3)
All developments are done! The scope of the release is complete
Evaluate the risk of pushing to production the release
Define the strategy of risk mitigation
Block the versions of all the software dependencies
Deploy the release candidate in staging environment
Check that the staging environment is well and that it contains as expected the release candidate
Go ahead and run the risk mitigation strategy (the drawing is some dark humour about the word “execution” of the risk mitigation strategy)
When we’re done with our own non-regression testing, communicate to partners that they can move on and do their own checks on the staging environment
If bugs are found (can be by the team or by partners), well, fix them! But then, that means we are doing some development again, so we’re back to step 3 and we must keep the partners posted and go through all the risk mitigation and push to staging environment stuff again
Before moving on with the actual push to production, make sure we know how to rollback and that the rollback process/tool is indeed working
Yeah! Everything is ready for pushing to prod
Proceed with a partial push to prod
Complete the release by pushing to prod to 100% customers

As you can see, this is a high-level, people process. Under the hood the developers also follow strict guidelines about how they handle the branches, tags… But that’s not the point here.

Communicate and synchronize, as much as necessary

One recurring element of this process is the focus on communication, on making sure we are in sync with our partners. This has been indeed a real pain point in the past. So this process helps us having a crystal-clear communication.

Don’t forget to communicate

We don’t like it but we have to, so thanks for the reminder.

Give them the information they need to make decisions

Our partners will need to book some people’s time to check our release in staging environment, so things will go smoother if they know in advance when that will be needed.
What should they check? They will check everything if we don’t share the change log and the risk analysis.

A change of plan requires an update to make sure everybody’s on the same page

We've found a bug! Let them know there will be a delay, or that the bug is known and that they should not test it again, or waste time reporting it.
The release candidate availability on staging environment will be delayed. Since they are booking some people’s time in advance, we want them to have this information so that they can update their own plans accordingly.
“It’s not available yet?” should never be heard. Our partners should know when it will be available. When the plan change, we must share the new plan.

Follow the testing strategy order

I have explained in the previous section how important synchronisation with our partners was.

Basically, it’s all about defining who tests what. To move forward, we had to define the testing scope of everybody, and to apply these tests in the proper order. Some would call that defining a testing strategy.

The point is that it doesn’t make sense to send a release candidate for validation to a partner if we haven’t finished our own checks. This has been seen over and over and over: we are confident in our work (and sometimes, in a hurry to get over with it) and so we throw it to the partners before we finished running our own non-regression tests. Virtually every time one of the following two cases happen:

We find issues that need to be fixed, but our partner has already wasted time validating the release candidate and will have to start it all over once we have fixed the issues and deployed a new release candidate
Our partner finds bugs and we realize that we would have find them by ourselves if we had finished running our own tests before asking for the partners to start validating the release candidate

In both cases, it will degrade our relationship with our partners as it will send the signal that we don’t value their time. One direct consequence is that it will be harder to book their time to perform the validation of our release candidates in a timely fashion. Which will lead in turn to a bottleneck in our release pipe, thus slowing us down.

Trying to go faster by rushing and not following the process will lead to a slow-down. You simply cannot get ahead of yourself.

Smart risk mitigation

We try to avoid running a full regression testing session as much as possible.

So we gather the exhaustive list of things that have been changed, also commonly known as changelog, and from there we try to guess what is at risk.

The goal is not to check that the new code works as expected; this has been already tested extensively by several people. Instead we want to check on the potential impacts of the new code. We call this a risk analysis.

From this risk analysis we devise:

The set of test scenario that we shall run
Which devices/OS/browsers shall be checked
Whether we should proceed with an all-at-once release strategy or instead do one or several partial releases on a reduced scope to incrementally probe the changes in production

Double-check the things that are known to fail

We have directly integrated into the releasing process all the known pitfalls of our current tooling. That was not to say that we would not try to fix those issues, but we knew that these issues would not be fixed right away so we had better learn to live with them for now.

Block the versions of the dependencies

… Because so far, nothing prevented those dependencies to have a version bump between the moment the release candidate was validated in staging environment and the moment the release was pushed to production. Meaning that what we push to production is not what we validated.

Scary thoughts.

And it actually happened with internal dependencies, mixing up production and experimental branches/versions of some specific component. Yeah, when you discover it it’s like the walls are falling apart.

So, until we find better ways to handle the versions of the dependencies, we just make sure we block them by hand.

Check the staging environment

… Because once in a while (and sometimes, many times in a row) the push to staging environment scripts just go wild and don’t work. And also because there are rough edges there and there and it happens that things don’t go as expected.

As the staging environment is managed by an external, “ops” team, fixing this issue is only partially in our hands. We can’t fix it right away. We have to live with it for now.

So, please make sure that everything is OK on the staging environment and that it actually contains the expected release candidate before proceeding with any testing.

Wasting time testing the previous version and raising bugs because the new features don’t work is stuff that actually happened to the team in the past.

Again, just do this check before saying that the staging environment is ready, period.

Make sure we can rollback if needed

… Because so far we had no actual rollback procedure in place (instead we built and deployed the previous version again), meaning that cancelling a new release takes ages. Which means that there is a big penalty for breaking the production.

On the other hand, if you are able to rollback in seconds, you can try bolder changes. Which is a very good thing because on most occasions you’ll learn at least one order of magnitude more in production than you've learned before. Delaying push to production “to make sure nothing is broken” is just wasted time when you are a web product, you’re missing opportunities while the cost of being wrong is very small.

Generally speaking, this step in the process was still important once we implemented a rollback procedure. Indeed we had to make sure that we were still able to rollback in case of changes in the deployment process, or in the build tooling.

Do these important stuff, as much as possible

The team must display an extreme rigor. This is critical or the team won’t be able to reach any level of excellence.

This can be seen in the process as a scope change and bug-fix lead back to the beginning of the process. Whatever happens, you have to go back and re-do it all again over.

In particular, when you fix a bug you have to challenge again the risk mitigation strategy: we have changed more code, is there any more impacts, and how do we handle them?

And again, make sure the versions of the dependencies are OK. Make sure every time you deploy on the staging environment that the staging environment is OK.

What is the process looking like now?

As the team has matured a lot, parts of this process has become obsolete.

Focus on the communication and synchronization with partners is becoming less and less of a concern as the coupling with these partners has been drastically lowered.
Planning also became irrelevant as the team gradually moved to Kanban with a Continuous Release strategy. That is each User Story is individually pushed to production as soon as possible, instead of having sets of User Stories grouped into the same release. We still need to synchronize several teams that share the same release pipeline; this is done by having all these teams share the same Kanban board for the release part.
Tooling has made it impossible to have different versions of the dependencies between release candidates in staging environment and push to production.

On the other hand…

Right now risk mitigation is still at the heart of how we work, even though more and more tests are automated. Maybe someday we’ll just run the complete automated test suite instead of targeting what’s at risk? Likewise, device compatibility testing shall become less and less needed as test suites will be automatically run against the set of targets that we’re officially supporting. But we’re not there yet.
The scripts to deploy to staging environment are getting better and better but are still not considered fail-proof.
We will always have to check that the rollback procedure works when changes are made to the deploy/build processes.
Partial push to production is obviously the way to go, but ideally we would do it in a much more refined way than what we did so far. The ideal vision being seamless A/B testing.

Want more?

This article is part of a series about how we managed to get to release on a daily basis. Have a look at the other articles:

Our recipe for Daily Releases

It was upon a time a team that struggled to release…