We caught some time with Devoteam Head of DevOps Graham Zabel to talk about changes he’s seeing in DevOps, the challenges of transformation at scale and Devoteam’s recent work with TUI Digital.
In the large organisations you work with, what have been the big changes you’ve seen with DevOps over the last few years and how would you characterize where we are now?
It sounds obvious, but the move to the cloud and infrastructure-as-code are big changes; changes that most large companies are only beginning to come to grips with. There are pockets of excellence all around, but to make this transformation at scale brings different challenges. Skills shortages. Autonomy versus control.
Another important new change is the move to integrated pipelines. We are moving to a world where CI/CD pipelines become commodity items running on cloud platforms. You check in code and the rest of the CI/CD process just works, ending with the deployment of the newly built code into a Kubernetes cluster.
Cloud and GitOps pipelines are entirely integrated. AWS, Azure and GCP all have integrated CI/CD pipelines. GitLab CI has end to end ‘GitOps’ integration. So why build your own pipeline? The days of multiple tools from multiple vendors that require multiple integrations with multiple plugins are coming to an end. And hence a lot of the work that was done by traditional DevOps engineers is also coming to an end.
In that respect the industry is maturing. The next step is to become more mature with metrics and measuring. Business value is the most important metric to measure, and possibly the hardest to capture and understand. If you can measure increased income, or customers, or share price, or sales, or clicks, great. That will be an obvious way to prove the value of what you are delivering. But what if you are a platform team building an internal service, say IP Address Management or Database Admin? How do you prove your value? Do you count the number of tickets you resolve, the number of Jiras you complete? The ability to measure the business value of what you are producing will help you understand where to spend your money and resources. We are still learning how to do that.
Another challenge that we are only beginning to understand is the role of the SRE. What does an SRE do? What are the roles and responsibilities? Is this something new, or a new name for something we’ve always done? Do we hire new staff, retrain existing staff? How do we train them? What happens to traditional Ops? Who takes the pager this weekend?
What are the hurdles to scaling up DevOps in these enterprise-level organisations and what are your suggested approaches to overcoming them?
Finding the resources to help achieve these transformations at scale isn’t easy. Those with the skills are in high demand. The education system is struggling to keep up with the speed of change, and training people in these new skills. Expensive coding bootcamps for software development retraining are 10 times over-subscribed.
Those who have been in IT for a long time are worried about their skills and if they’ll be relevant in the cloud native world. There are a lot of new things to learn and change is happening fast. Many traditional processes, such as providing DNS names, or backing up and restoring databases are easily automated in a cloud-native world so those specialist skills will not be needed nearly as much. This is an example of the Pareto Principle where 80% of tasks are mundane and will be automated, and humans will work on the 20% that require specialist knowledge. The mundane tasks will be automated. So the specialist skills can focus on work that is not easily automated – tuning complex databases for example.
Some of the ways we at Devoteam try to overcome these hurdles is by re-training our internal staff. If people are interested in DevOps and the Cloud we encourage that and try to get them learning the cloud basics quickly. Focusing on getting certifications (and keeping them up to date) is important.
We need to get better at measuring and monitoring. The ability to measure is so important. You need to be able to measure the value of delivering your product and the velocity at which you are delivering it. If you go slow, your competitors will out run you. In order to go faster, you need to understand where your bottlenecks are. DORA’s four key DevOps metrics help measure velocity, and to understand where there are issues, and where velocity can be increased. In fact, this is a key role for the SRE – to increase the velocity of development.
Measuring the business value of what you are delivering will be harder. It is important to figure out how to do this. Measuring usage is a good start and sometimes a good proxy for business value. How many people are using my system or platform? How many hits to my APIs? How many downloads of my app? How many transactions processed? But the business may want more than this. How much does this cost? How much am I saving?
Measuring at scale, across a portfolio of products say, presumably requires some sort of consistency. Or does it? Does every product report metrics to a central metrics collector? Do we insist that products provide a ‘metrics’ endpoint for querying? Do we publish metrics as events? Our advice is to start small. Start with a few DevOps-mature product teams. Get them to show how business value metrics can be collected and analysed. Find low-cost solutions to metrics collection that work and that can scale. But at the very beginning, Excel spreadsheets and graphs are good enough. Being able to analyse the data is most important. Proving that the correct data gets captured must happen first. Automating the collection of that data can follow, once its worth is understood.
The historical trends of these metrics are important too, so storing historical data is important. How should this data be captured and stored? These are questions that are not as important for one product, or a small company, but become very important at scale with products and teams fighting for scarce resources – both people and money.
If as a large organisation you haven’t started your transformation journey to Agile and DevOps, where is a good place to begin?
Scaling is hard to control, so large enterprises wrap lots of controls around processes when they scale. Mass migration to the cloud requires controls. The controls become bottlenecks. We slow down, when we are trying to speed up.
Start small and learn. Learn how to measure the value and velocity of what you are delivering. Learn how to map your value stream. Discover the working states, and the waiting states. When you get good at this at small scale, you are then in a much better state to begin scaling. You have an example that works, that can be used to demonstrate to the wider organisation. You have begun to build expertise in cloud native development and transforming teams and products.
Devoteam have recently worked with TUI Digital on their transformation, what were the biggest obstacles to change at TUI and what has been achieved so far?
One of the biggest changes was breaking down the Dev and Ops silos and defining the SRE Role. Moving from traditional Dev and Ops departments to a place where we bring these groups together helped by the SRE role and a “you build it, you run it” philosophy. Thankfully these obstacles were not too big because we had senior buy-in. The Head of Operations saw the need for this change and was convinced it was the right thing to do even though he would lose much of his ‘Ops’ team as they transitioned into SRE roles and became aligned to products rather than to him. This journey is by no means finished though. We only tried this with one product. We need to demonstrate success – to show that this increases velocity and business value – before we can roll out these changes at scale.
Another obstacle to transforming is getting the non-functional stakeholders on-board. These are what we call the Control Tribes – Audit, Compliance, Security, Architecture, Operations, Accessibility, etc. If you increase your development velocity by 200%, but it still takes three weeks to get Security’s approval to release, then you still haven’t transformed. You may be a victim to local optimizations that don’t show the results you expect. It’s important that the Control Tribes are aligned, that they understand what their requirements are and that these requirements are visible. In other words, the non-functional requirements should be in the same backlog as all the other work. These requirements should be visible to all, and prioritized along with new features. We had great results working with Security at TUI, in getting Security to express their requirements as Jiras, and to get these Jiras on the backlogs of the product teams. We now need to do the same with the other control tribes.
Finally, creating Communities of Practice is challenging but fundamental. We need to find the people in the organisation that are passionate about DevOps and bring them together. They will be leaders in the transformation. And they will need to communicate, communicate, communicate the plans and strategies of the transformation.
Was TUI Digital typical of the organisations you work with? If not, are there some commonalities you can share around recipes for success and pitfalls to avoid?
I think in common with many companies TUI are learning how to go fast, but safely. Everyone wants to deploy faster, but security controls are getting stricter, so security becomes a bottleneck. How can we make these processes lean, yet safe? TUI have achieved a fairly simple solution by representing all security control issues in Jira, and making sure these Jiras are on the same backlog as the features. Make work visible, both functional and non-functional work, and use common boards and tools.
Another common issue is defining the SRE role as mentioned above – what is this role and how do we fill it? Many organisations are trying to figure out how to implement the SRE role, and who should fill this role. At TUI this started with Human Resources and defining an SRE role description. It also meant coming up with the roles and responsibilities of the SRE and how they compare with what current Operations staff are doing. What changes? What remains the same? Getting a clear picture of this role will help you find the right resources for it, who in turn will get you prepared for the cloud native world.
Facing challenges with implementing DevOps at scale? Talk to us!