Long-lived environments

jun 12, 2021

In the last ten years I’ve always worked in companies where there were more than one long-lived environment.

Why having multiple environments?

In my opinion, as companies started suffering more and more bugs, they thought it was a good idea to put inspection steps to avoid them. In a first stage this seems intelligent, we have more opportunities to detect errors.

But the problem is not about detecting errors is how we do it, manually. Long-lived environments only exist (apart from production) because we want to verify manually the functionality, and verifying something manually is slow.

To achieve this conclusion, I will explain one story of one of my previous jobs.

The story

It’s a story full of mistakes, but we really learned a lot.
I was working in a company who realized that one of their products had not quality enough to be extended. It was needed several months to put it in place in another country. As the company was planning to expand to a lot of countries with that product, they decided to create another thing from scratch (perhaps this is the first mistake).
As always when devs created things from scratch we thought this time we will do it better but as we didn’t know how to do it better we committed the same mistakes. So our new fancy microservices architecture (it was a distributed monolith in backend, fronted was a monolith) were full of bugs because we didn’t have tests. There were a lot of YAGNIs in the code just because the previous system had another requirements, and we used it like the source of what to do.

So at a certain point when the project was delayed we realized we had to change, we took some hard decisions:

Covering the new code with tests. To make people understand why testing and how.
Use gitflow as our branching model (another big mistake)
As we had gitflow we decided to create what we called the “Integration Environment”.
And we started doing pre-merge code reviews (before merging to master). It was crazy because the code review it was after the test in isolation so forced QA to test again this when it was merged in dev because of the changes done in the code review.

I will focus on the part of why we created the integration environment and what it was. The idea was to validate each story in an isolated environment so QA people could test it before it was merged to dev. After that, QA had to test it manually again when it was merged in dev in order to create a release and deploy to production.

So we created this environment that was done in Docker Swarm. It was able to create an isolated environment per each branch called US-XXXX, these environments were killed when the story was merged to Dev (dev was also an environment in our “Integration Environment”). As we were working in a microservices architecture, the system was able to put in the same environment each microservice with the branch selected that had the same name (this was automatically done when a branch was created). The new environment also was able to monitor through a web application the logs and the status of the environment.
In the company everyone thought the environment was great and with these things in place we were able to create a product with more quality, but we started suffering some new problems with this approach.

We were deploying to production every two weeks when the sprint finished because we thought “this is how to work in Scrum”. So as we didn’t have enough QA’s (the bottleneck) to validate the stories created by the team, the end of the sprint was always a crazy time. Full of stress because we had a lot of conflicts because people were developing user stories for several days, and later we had crazy conflicts trying to merge them. Next sprint was always full of tasks to stabilize the previous sprint release.

We thought it was a good idea to introduce quality in our product through more inspection steps (mapped with more environments), and then this model faced reality. We didn’t have enough QA’s to validate user stories, and we had a lot of conflicts because we had long-lived branches. QA’s had to do crazy stuff to test some features because the environments had not enough data (it was not production).

Everything in software architecture is a trade-off.

First Law of Software Architecture

Notice that we selected some trade-offs because that way of working, but we were not able to pay the bill. We were not able to manage the complexity inside the team introduced by long-lived branches.

In this story, the Integration Environment only existed because we wanted to do gitflow and testing things before merging them in dev. In fact, we were trying to solve a quality problem through a process focus on manual validation of the job done.

Put complexity in the code

The integration environment had a lot of complexity, there were a lot of code in place, we needed 5 machines to create a Docker Swarm cluster. We needed to maintain the environment, we had to learn a lot about Crane, Docker and another technologies to make it work. Just because we didn’t realize this could be managed inside the application. Even there was a team “The Platform” team, my team, in charge of the environment. Another mistake, we created a silo for managing the environment used mostly by devs.

Why not having these capabilities inside the application and avoid all this complexity managed by other teams, not the development team?.

We realized that all our problems with merges existed because of long-lived branches, that our integration environment existed because of long-lived branches, that our lack of QA’s existed because of long-lived branches.

So why not simplifying everything and putting things in the code?. It’s true that the code will be more complicated, but the code is easier to change than teams, processes and roles.

We started thinking on moving to do trunk based development (CI/CD):

We created a pipeline to deploy to only one environment when things were merged to master.
We put in place a CI server to validate our merges and releases.
We encourage people to do small tested commits and push directly to master with high frequency.
We give devs the responsibility to deploy to production when they wanted.
Devs learned to use tools like feature toggles or dark launch to be able to push frequently to master unfinished work.
Test were done by devs and QA’s started working on giving people tools to test better.

So our problems disappeared, no more stress, no more crazy merges, no more QA’s full of manual work. It is true people needed to learn a lot, but I think it’s just cheaper than bleeding every sprint. You know why not solving problems following better engineering practices than through processes?.

In fact, we realized that our product is not only full of requirements from product but also needs to contain Quality inside, so It needs to support that quality. Quality introduces requirements in your product.

So having multiple environments is bad, I would not say that, but I think it’s a smell. You cannot count complexity just in the code, but also complexity comes from your platform, your culture, your architecture. Sometimes simplifying those things and moving inside dev teams codebase is just easier and cheaper for the organization.

The only environment like production is production, go to production faster, so create fewer environments (even just production?).

El Substack de Javier López Fernández

Discusión sobre este post