A Path to Platform Engineering - Beginners Guide
With the recently published Certified Cloud Native Platform Engineering Associate (CNPA) an idea and direction for how to become a platform engineering is given. But there is more to know...
I often discuss the learning path to become a platform engineer. At conferences after talks, or book signings, people tell me where they are in their career and ask what’s next and how to become a professional platform engineer?
A “perfect” platform engineer needs to combine 3 qualities:
40% on being an expert on the tools and technologies that are playing together to become a platform
40% on managing developers and operations demands, understanding their needs and applying that feedback into your platform product
20% on whatever is relevant to your organization
The last 20% can sometimes be even more important: Consider the situation where you work at a financial institute. You might focus on a heavily isolated and secured environment. As a contrast, you are working for a game company, latency and availability are now your key drivers. Any of those aspects will influence drastically how you gonna build platforms.
Tools & Tech
I think a good guide to the tools and technologies that you should master is the platformengineering.org landscape. However, it provides tons of options that you might not need. So I will give you a very opinionated view on how I would do it, if I would need to start by 0 again.
The following order of the segments are also the order I would learn. Recommended certifications are just like checkmarks and might help for your future employment.
Resource Plane
I would start with a cloud provider, ideally AWS or Azure, yet I would go with AWS. The benefit is that you can get started easily, there are plenty of free materials (e.g. FreeCodeCamp) around it, and once the cloud concept is understood, you can apply it to any other provider too.
For a platform it will become very important to have a deep and fundamental understanding of networking, storage, and computing services, especially for Kubernetes.
In addition, you should spend a good amount of time to learn and understanding container technology. You will meet this everywhere: in the resource plane, CI/CD, it’s important to understand for security and observability.
But what about VMs and Bare-metal?
Honestly, those are harder to do and to learn. Generally, I’m not a huge fan of VMs; to me, they don’t make sense to have. There is at least one critical CVE for hypervisors every year, and I believe thanks to closed source (ya ya except OpenStack…) there are many 0-days we don’t even know. Yet, there are many cases where containers are also not the right thing to do.
If you have the budget and time to go down to metal, you should do, it is worth it, but if not, don’t hunt for it.
So, somewhere between the resource plane and the integration & delivery plane, we also have to place Infrastructure as Code (IaC). Technically, it lives in the Integration & Delivery Plane, practically it handles the resources…
My take: go with Terraform (or OpenTofu), it is most widely used and can be applied on any provider. Some cloud provider-specific implementations might come with some benefits, but often lack wider understanding.
Qualification track:
AWS Certified Solutions Architect Associate
AWS Certified SysOps Admin Associate
AWS Certified DevOps Engineer Professional
Certified Kubernetes Administrator (CKA)
Terraform Associate
(optional, would make you stand out a little more) Terraform Professional
Unfortunately, the AWS certifications to qualify you a little for deep platform knowledge. Therefore, you absolutely should spend time on the AWS EKS Workshop.
Integration & Delivery Plane
In Platform Engineering, we might have overcomplicated the automation part. It is the core of every platform and the most relevant part to learn and understand.
For the basics, start with GitHub and GitHub Actions, it is widely used in enterprises. Microsoft provides a good learning path.
I wouldn’t worry too much about container registries at that point. You can partially cover it by digging into AWS ECR as part of your resource plane journey. But there are also more powerful solutions like Nexus and Harbor. But I wouldn’t waste time here. More important are the concepts of GitOps.
For GitOps I’m slightly opinionated: use ArgoCD. Also, there is Flux, and thanks to good marketing, it continuously pops up and is simpler than Argo; I didn’t find it that often in the wild. ArgoCD has many getting-started guides. But, handle it with care. To get started ArgoCD is enough. In your later career Argo Rollout, Events and Workflow might become interesting, too. Important to understand is that you also need to understand the concept in depth.
I think neither Platform Orchestrator nor Infrastructure Control Planes are very much needed at the beginning of a Platform journey. BUT, if I could spend money on one, I would opt in for Infrastructure Control Plane. Most companies don’t use it and have pure chaos, drifts and outdated infra because they don’t want to spend money on it.
Qualification track:
GitHub Actions Certification
Certified Argo Project Associate
(just if you want to burn the money) Certified GitOps Associate
At this point, I’m always unsure about digging into the topic of what is not shown in the reference, which is what you should add to your CI/CD and GitOps pipelines.
What do I talk about?
Security scans e.g. before/after container builds
SBOMs
or infrastructure cost estimations
Monitoring & Logging Plane (Observability)
The observability stack can be a fast one, but you can be also lost in it. Defacto standard is today to apply Open Telemetry (OTEL) and use it with an open source stack like Grafana & Prometheus.
Grafana Labs provides a couple of hands-on workshops where you can get started with some ideas of what to do. Now, observability is easy to start with, but in platforms at scale you will have to master it all of its complexities.
Qualification track:
OpenTelemetry Certified Associate (OTCA)
Prometheus Certified Associate (PCA)
Security Plane
As already said before, security will find place on many spots. How to configure your resources correctly and securely? Where to apply security scans? How to manage the access? Those elements are scattered and integrated in every domain.
Let’s start by an important one, secrets and certificate management. Also, technically it’s two things, we see them play along together frequently. To handle secrets we have to take a look at vault or key management solutions. The AWS KMS you might find on your way over at your cloud journey. In addition, Hashicorp Vault has become widely used, also due to license changes, the adoption might have dropped. A pure open source alternative, OpenBao is in development.
At this point, we will hit a problem of the reference architecture. It shows the foundation of the platform, but it doesn’t show the inner guts of a platform. What do I have to install on Kubernetes?
Policy engines and management like Kyverno or OPA work best from within the platform. Network encryption is ensured by the resources you use and from within the platform via mTLS configured on the container network interface. Last, we talked about security scans in the CI/CD; best practices would be to go further ahead and bring it into the platform as something called admission controller, that continuously scanns the system.
And we didn’t touch the basics about Role-Based Access-Control (RBAC).
In short, security is everywhere, in many layers, and you, as a soon-to-be Platform Engineer, have to know it too!
Qualification track:
Vault Associate
(optional, would make you stand out a little more) Vault Professional
Certified Kubernetes Security Specialist (CKS)
Maybe a CompTIA Security+ (very generic) or a cloud provider-specific security certifications (very specific) comes on handy too
Developer Control Plane
The last category is the Developer Control Plane. Many people who are looking at Platform Engineering and Internal Development Platforms (IDP) are understanding this as a Developer Portal, similar to Backstage.
The most successful platforms I have met, don’t use Backstage. And many teams also hate the solution itself. In my opinion, there are many misunderstandings. Backstage is very good in showing you what you already have and can, in a single place. But what you don’t have, you can’t show off with.
Until this point, if you have followed along the track, you should have mastered version control in the one or the other way.
I usually see at this point in companies that there are two paths taken:
Providing unified development environments
Start working with a Developer Portal
Both ways are a good choice, but they massively depend on the corporate problems you want to solve. (That’s why the next chapter is so important).
Qualification track:
Certified Kubernetes Application Developer (CKAD)
Certified Backstage Associate (CBA)
(The missing child) - The Capability Plane
For me, there is one part missing in the reference and learning path, and that’s what is on the platform. I call it the Capability Plane. Because many things that we are implementing for the platform, are around and under the platform. Those have to be made available in the platform too. Those are reflected as features or capabilities of the platform provided to the platform user:
As you can see we can find again many pieces that we talked before about in security, networking or observability.
There are some aspects that are important to delve down on it:
Scale and Scheduling are core features of Kubernetes, but they need to be fine-tuned to your environments. On a cloud provider, a Karpenter comes in handy for optimizing the infrastructure. An application or serverless-aware framework might be useful, too. And a batch system like Volcano is in times of massive data and AI also very much needed.
All those need also the right integration of resources, and especially a good management and access control.
For me personally, the capability plane is what brings the value to the user. Because a platform consist of two layers:
the infrastructure, automation and integrations that are forming the platform
the services and capabilities that make the platform useful for the end user
People, Culture and Processes
A shameless plug at this point for my book Platform Engineering for Architects - Crafting modern platforms as a product. We aimed to write a book that is timeless, focusing on every element of the platform engineering that will ensure its success, but without sticking to a specific technology.
Therefore, I will give here a rundown on the important parts:
A platform's purpose - don’t build a platform because it’s cool. A platform should fulfill a purpose, helping your organization to become better in its engineering journey and solve problems your developers, operations teams, DevOps etc are facing.
Defining principles will help you stay on track and not start with “Randomeering”
Before you target for an whole aircraft carrier size of complexity, start small and simple with the thinnest viable platform (TVP). Develop the one feature that will help one of your end user groups a lot.
You have to fight Conway's Law. Break it, change it, make it work for you.
Target Developer Experience and the reduction of Cognitive Load. Today, we hear often “shift left”, I think we shifted to much left, throwing more and more on the plates of Devs. But not alone on their plate, but also other users are facing drastic increase of complexity. For example: I haven’t met in the last 10 years a corporate security that was up to the technology and time we were working with. As a platform engineer, you have to help them.
Security is everywhere. Many security issues are created by people and processes. It’s often not the technologies fault.
Learn to manage technical debts. Open source and cloud native are moving fast, and there is no way that you can take a technical decision without causing technical debt.
You will need to learn how to optimize your platform usage, shrinking down costs, and leverage the utilization, for example, for expensive hardware.
Build golden (opinionated) paths with escape routes. Your standardization platform 1 (Kubernetes API) and 2 (cloud API) will still keep the user on a controllable field.
Besides all those aspects, as a platform engineer you have to learn to work with people together, get their feedback, be open for their contribution and show them how to be successful. A good success factor is inner open communities. Be welcoming to their contribution and give frequently introduction and updates to what is new.
As a platform engineer, you, your management, and your executives need to learn to understand the value of product thinking. No matter how good a platform is, it dies if it is treated like a project and loses budget, people and priority after the project is done. Internal Development Platforms deliver a lot of value to your organization. But you have to make this transparent. For this, measure different metrics, from DORA and Space to DevX, before the introduction of an IDP. Measure it during and from there on, continuously after the IDP introduction. This will show the continuous value of the system, of your team and your work.
A Platform Engineering is many, it is not a new role, it is a new level
I’m always very proud of my people and team. Those are the best people you can find. But why?
A Platform Engineer is not a new role; it is an entirely new level.
To become a Platform Engineer, you need to master the role of a Cloud Engineer, a DevOps, an Requirements Engineer, and depending on the direction of your platform also of an SRE, Security Engineer and for sure in the end as Solutions Architect.
To be a professional Platform Engineer means you are at least 80% better in every one of those domains than the rest of us.
And yes, that’s maybe a bold statement. But if you don’t understand this statement, you might haven’t met yet a professional platform engineer.
See you out there, cheers!