If you have worked with Microsoft Azure before, you may have managed infrastructure in one way or another. In Azure, just as in any cloud platform, infrastructure can be created and altered quickly and easily. Using one or more of the Azure portal, PowerShell cmdlets, RESTful HTTP calls, SDKs, or ARM templates, you can create servers or PaaS and SaaS services in minutes or even seconds. This is in contrast to how infrastructure was managed in the past, or often still is on-premises.
The unique proposition of the cloud has transformed the way we create and operate software in the last decade. In particular, the way we manage infrastructure that runs applications has changed. Creating cloud infrastructure on demand and discarding it hours or days later has become a common approach, especially for test environments.
Two characteristics of the cloud, in particular, have accelerated this change:
Elasticity is a characteristic of cloud computing. In the context of the cloud, elasticity is the capability to quickly add or remove resources from your infrastructure. Unlike traditional server deployments, clouds allow you to pay for infrastructure by the hour, minute, or even second, which allows for flexibility and encourages different approaches to provisioning infrastructure.
Self-service is a second characteristic. All of the major cloud vendors provide their users with graphical user interfaces (GUIs), command-line interfaces (CLIs), and APIs that they can use to create, update, and remove instances of the services they make available. Nowadays, all cloud providers use an API-first strategy, and one outcome of this is that every operation is also available through their management APIs, not just through the user interface or other tools.
The combination of these characteristics causes us to treat cloud infrastructure differently than traditional on-premises infrastructure. Spinning up complete configurations spanning tens of services can now be done in a matter of minutes. You can do this either using the major cloud providers’ portals or by using scripts in your deployment pipelines.
However, using portals or CLIs to do this does present downsides—it is challenging to manage your cloud infrastructure reliably over time. Examples include changes being incompletely tracked, developers needing to access production environments with personal accounts, and many more. For this reason, another approach to managing infrastructure has become the go-to option for most teams: Infrastructure as Code (IaC.)
In this chapter, you’ll learn more about managing cloud infrastructure in general and about the benefits of using IaC over manual and scripted approaches. Then we’ll look at the Azure Resource Manager (ARM), the service that you interact with to manage your infrastructure in Azure, and at a few other tools for managing Azure infrastructure.
Infrastructure as Code (IaC) is a modern approach for managing infrastructure. Instead of creating and configuring infrastructure manually, using graphical interfaces, all infrastructure is described in configuration files that are then used to create the infrastructure automatically. For Azure, IaC is written in Azure Resource Manager (ARM) templates or Bicep files, which are submitted to ARM for processing.
When we talk about infrastructure in the context of Azure, we are referring to all Azure resources that you can use as part of your solution architecture. An obvious example would be a virtual machine or storage account, but infrastructure in the context of IaC also includes service bus messaging queues, dashboards, app services, and any other deployable Azure resource.
Before we dive into the background of ARM and the benefits of IaC, let’s look at an example. Figure 1.1 shows a small snippet of an ARM template and how it can be used to create Azure resources, like an Azure storage account.
ARM templates (at the left of figure 1.1) are formal descriptions of what infrastructure needs to exist and how it is configured. These templates are then applied to an Azure environment, creating the infrastructure described (at the right of figure 1.1). If a resource with the specified name and type already exists, its configuration is updated instead.
As you have already seen, the characteristics of the public cloud encourage the use of IaC, but that’s not the only reason for using IaC. Two other drivers are the DevOps culture and the desire to prevent configuration drift. The next sections discuss these two topics in detail.
DevOps is a cultural movement that aims at breaking down the traditional barriers between development and operations teams. In a traditional organization with operations and development teams, the two types of teams have clear responsibilities.
Development or application teams are mainly responsible for implementing new requirements. They are concerned with introducing as many changes as possible, as that is how they implement new user requirements.
Operations teams are responsible for managing the infrastructure and often any applications deployed to it. The operations team is mainly concerned with keeping things running, which, in general, works best when there are as few changes as possible.
Figure 1.2 shows what this looks like. Here you see a dedicated operations team that manages infrastructure and other runtime components. A separate development team writes updates and hands them over to operations for deployment. If such an update requires a change in the infrastructure, this has to be requested beforehand, often well in advance. These infrastructure changes have to be coordinated between teams and are often slow to complete.
In many organizations, the opposing goals of these teams or even of complete departments lead to unhealthy situations, such as these:
Operations teams can become resistant to change because changes introduce new risks into the environments they have to manage. The introduction of validation procedures, quality controls, approvals, or any other type of gatekeeping behavior limits the flow of change.
Development teams push changes of insufficient quality because they receive praise for the amount of change they create. At the same time, operations teams are impacted by any downtime that results from bugs or issues in the software that is released.
Of course, this causes problems for the organization as a whole, which is served best by controlled, well-coordinated changes that implement new requirements while the existing infrastructure and applications keep running smoothly.
The DevOps movement advocates that developers and operators should work together in a single team toward this shared goal: the continuous delivery of high-quality software that creates business value. The subgoals of stability and change should be committed to by this single team that combines operations and development expertise. While doing this, a DevOps team often adopts development practices to perform operational duties.
In practice, this means that a new, now-combined, DevOps team is responsible for creating their own infrastructure (see figure 1.3). Often this also means that IT professionals start to apply development techniques to their day-to-day work. They transition from the user interface and manual application and the verification of changes to adopting advanced text editors, automated installation scripts, and IaC. IaC allows developers and operators to work together to describe and configure any infrastructure needed by their application deployment. Together they can promote the infrastructure changes and the application artifacts to a test environment, and after verification to a production environment.
Figure 1.3 A DevOps team is aligned with the applications and infrastructure it is responsible for managing.
Next to self-service APIs and DevOps, another driver for the adoption of IaC is the prevention of a problem called configuration drift. Configuration drift is a phenomenon observed on infrastructure that is managed manually. It doesn’t matter if it is managed through the command line or a graphical interface—configuration drift can happen in both cases.
Configuration drift refers to differences that develop over time on either of two dimensions:
A difference between environments, such as between test and production
A difference within an environment, such as between two virtual machines that should be configured exactly the same and host two instances of the same application
To see how this configuration drift can occur, imagine an infrastructure configuration of two identical virtual machines (VMs), with one being a test and the other a production environment. The two environments should be configured in precisely the same way, because the test environment is also used for load and stress testing.
Figure 1.4 illustrates two types of configuration drift. First, there is an unintended difference between the test and the production environment, as the production VMs have more memory allocated than the test VMs. Second, there is a drift within the production environment, as one of the VMs has four cores instead of the desired two.
Configuration drift is often the result of an unexpected, incomplete, or incorrectly executed change. When a change is required to the configuration of any infrastructure component, that change must be applied to each instance of the infrastructure, one by one. But other things can happen as well:
A change is made to the development, test, acceptance, and production environments, after which an issue with the change is found at night: a bug. The change can easily be reverted, so it is reverted in the production environment. There is a lot of user feedback to deal with the next day, and reverting the change in the other environments is forgotten.
During a major outage, all non-production environments go down and have to be restored manually. Accidentally, a more recent build of the operating system is used on the virtual machines, making them behave differently than the virtual machines in production.
Differences between environments like these can cause future problems. For example, test results from the test environment will no longer be representative of how a particular change will affect the production environment. Given enough time, configuration drift will affect any environment and result in unpredictable behavior. IaC can help remediate configuration drift by re-applying the infrastructure specification regularly. Because all settings are stored in source control and applied automatically, all changes are detected and corrected automatically.
We’ve mentioned three main drivers for using IaC—namely the cloud, DevOps, and the prevention of configuration drift—but there are still other benefits of IaC. Let’s take some time to explore the benefits IaC offers over managing infrastructure manually or through scripts.
Once a team moves to IaC, often because of one of the drivers we’ve already discussed, they will also start observing other benefits. As with many developments in our field, this change will not only help to overcome existing problems but will also inspire new ways of working.
These are three common benefits:
IaC allows for automation, which saves time if you are often creating new environments.
IaC allows for a declarative approach, which allows you to focus on the desired state and not on how to get there.
IaC provides a human-readable format, which allows developers to reason about the state of the infrastructure.
The next three subsections discuss these benefits in turn.
As you may have guessed by now, IaC is applied using tools, and tools imply automation. This delivers two additional benefits, besides saving time: guaranteed outcomes and environment reproducibility.
Automatically creating and configuring environments not only saves time, it also provides guaranteed outcomes. When creating and configuring an Azure virtual machine manually, there are tens if not hundreds of configuration options that have to be checked. In practice, this is very error-prone work, and mistakes are very likely to happen. Asking five different people to create an Azure VM with 4 CPU cores, 8 GB of memory, and running Windows Datacenter 2019, will most likely result in five virtual machines all configured differently.
With IaC, this is not the case. After you write the desired infrastructure in a code file, the same file can be applied repeatedly, and the IaC tools guarantee that the outcome is the same every time. Verifying configuration or testing outcomes is no longer necessary when working with IaC. It not only saves a lot of time; it also improves quality.
Once an IaC file is written, the cost of creating the described infrastructure is almost zero. It is just a matter of starting the tool, and the required infrastructure resources are created and available a few minutes later. This unlocks all kinds of new approaches to testing, deploying, and running infrastructure.
Just the ability to automatically remove development and test environments at 6 P.M. and re-create them automatically at 7 A.M. on working days can save organizations anywhere between 30% and 60% of their infrastructure costs, compared to keeping infrastructure running 24/7.
Also, if you have ever been responsible for test infrastructure, you’ll know how hard it is to keep test infrastructure usable. Every test failure can pollute the infrastructure and trigger false test failures in the future, due to the inconsistent state of the previous run. Just imagine the possibility of creating new infrastructure, in a guaranteed state, before each test run starts. And all of this at no additional cost. The reduction in false test failures will save a lot of time, money, and negative energy spent by you and your team.
IaC can be written in two different styles: declarative and imperative. With the declarative style, the source files describe the desired state of the infrastructure. The execution engine is then responsible for comparing the desired state with the actual state, determining the differences, and identifying and executing a series of commands to make the actual state correspond to the desired state.
This approach is similar to Structured Query Language (SQL). You can use SQL to describe which records should or should not be in your result, rather than having to specify the commands to execute. The database engine is then responsible for determining which commands should be executed to reach that desired result.
With the imperative style, you do not describe the intended end result but instead describe the series of commands, steps, or program code to execute.
Note The term Infrastructure as Code is also used for approaches where scripts are stored in source control. While this is a correct use of the term, most IaC approaches, including ARM templates, use a declarative approach.
The first benefit of a declarative approach is that it enhances both the ease of writing and the ease of reading. Writing in a declarative style is easier, because the writer does not have to worry about how the infrastructure is created. They just need to describe what is needed in the end, and the tool translates this into the how. This applies both to when infrastructure is created the first time and when infrastructure configuration is updated. In an imperative approach, this would result in a lot of if-then-else coding; in a declarative approach, if-then-else is not necessary. As an example, see these declarative statements:
There should be a car The car should be green The car should have four wheels
Compare that with these imperative statements:
If there is no car Create a car If the car is not green Make the car green While the car has more than four wheels Remove a wheel from the car While the car has fewer than four wheels Add a wheel to the car
As this example shows, the declarative style improves the ease of writing and also enhances reading ease, as it focuses solely on the desired state.
The second benefit of a declarative approach is that the execution engine can be optimized without impacting the IaC declarations. In the similar case of SQL statements, SQL database engines have seen significant changes and optimizations over the last few decades, but most SQL statements written years ago still execute without any changes.
The third benefit of IaC is that it leverages human-readable formats. Some IaC tools use JSON or YAML, and others use a custom domain-specific language (DSL) or existing programming language. Azure Resource Manager templates use JSON, which stands for JavaScript Object Notation. This human-readable format provides us with a version-controllable, auditable, and reviewable definition of application infrastructure. On top of ARM templates, an easier, more approachable, DSL has been introduced, called Bicep.
Human-readable, non-binary files can be stored in a source control system, just like source code for an application. Source control systems provide users with a centralized, single source for the latest version of a file, along with a full history of all changes. Gone are the days when you had to record all the infrastructure changes manually to go back and find out which changes were made when, by who, and why. With source control, you automatically have the complete change history readily available. Another consequence of this is that if there is ever the need to roll back a change, the previous configuration can quickly be restored.
IaC files are readable and all changes are recorded in source control, which makes them instantly auditable by security reviewers, external auditors, and any other party interested in the changes you are making. Source control provides a full audit log of all the changes made and by whom.
Source control systems allow you to automatically enforce standards before any change is made final. This can include automated formatting checks, automated build checks, or even enforced peer reviews—this functionality is built into most source control systems.
Now that you know about the extra benefits you can get with IaC, let’s turn to the Azure Resource Manager. Azure Resource Manager is Azure’s service for working with IaC.
We’ve discussed the drivers and benefits for IaC, so it’s now time to dive a bit deeper into the IaC system for Azure. The first thing to understand here is that all Azure infrastructure management is done using the Azure Resource Manager (ARM). ARM is a RESTful HTTP API that you can call to list, create, update, and delete all resources in your Azure subscriptions. If you interact with Azure through the portal, the CLI, or Azure PowerShell, you are also using ARM under the hood.
ARM is the basis for the Azure IaC capabilities provided via ARM templates. ARM is the execution engine for IaC. But before we dive into ARM templates, it is important to know what the control plane and data plane are, how they differ, and what you can and can’t do with ARM templates.
Each interaction you have with Azure is either a control plane or a data plane operation. Simply put, you use the control plane to manage resources in your subscription, and you use the data plane to employ the capabilities exposed by your instances of specific resource types. In Azure, there is a single, unified control plane: the Azure Resource Manager.
To make the difference between the control plane and data plane clearer, here are a few examples:
You create an Azure SQL database through the control plane. Once it’s created, you use the data plane to connect to it and perform SQL queries.
You create a Linux virtual machine through the control plane. Then you use the data plane to interact with it over the SSH protocol.
Requests sent to the control plane are all sent to the Azure Resource Manager URL; for the global cloud, that is https://management.azure.com. From this URL, it is possible to build complete URLs that identify any Azure resource. For example, the following pseudo URL points to a virtual machine:
GET https:/ /management.azure.com/subscriptions/{subscriptionId}/resourceGroups/ ➥ {resourceGroupName}/providers/Microsoft.Compute/virtualMachines/ ➥ {virtualMachineName}?api-version=2021-04-01
Suppose you are logged into the Azure portal and you copy this URL into your browser with valid values for subscriptionId
, resourceGroupName
, and virtualMachineName
. The response would be a JSON description of the virtual machine. If you study the response in detail and compare it to an ARM template for virtual machines, you’ll quickly notice that they are the same (with only a few default properties omitted).
Interactions with a resource on the data plane always happen on an endpoint specific to that resource. This means that data plane operations are not limited to REST but could use HTTPS, FTP, or any other protocol. Interactions with the control plane happen through the ARM APIs or through ARM templates.
The ARM APIs can be used to manage infrastructure in an imperative style, using provisioning scripts. If you prefer a declarative style, ARM templates are available.
ARM templates are written in JSON or Bicep and are used for any of the following purposes:
A resource group template is used to deploy one or more resources into an Azure resource group.
Subscription templates are used to deploy resource groups, policies, and authorizations to an Azure subscription.
Management group templates are used to deploy subscriptions, nested management groups, policies, and authorizations into a management group.
Tenant-level templates are used to deploy nested management groups, policies, and authorizations into the Azure Active Directory.
If you already have a basic understanding of the Azure hierarchy, the preceding list will show that you can completely manage Azure using ARM templates. If you don’t understand all the terms mentioned here, don’t worry—all these concepts will be explained in more detail in chapter 3.
While ARM templates are compelling and they allow you to manage all of Azure, an often-heard complaint is that they can be challenging to write and pretty verbose to read. To provide a solution to this, Microsoft recently launched project Bicep.
ARM templates are written as JSON files, but one of the disadvantages of JSON is that it can become quite lengthy when expressing complex structures. This lengthiness can make files difficult to maintain or read. Another downside of JSON is that there is no out-of-the-box support for control structures like loops or conditions. While ARM provides workarounds for this, ARM templates do take a while to master.
To provide a solution to these problems, Microsoft has introduced a new domain-specific language (DSL) as an alternative way to write ARM templates. This DSL is called Bicep, a play on the name ARM. Chapter 6 discusses Bicep in depth.
Before the existence of the Azure Resource Manager, another system was available for managing resources within Azure: Azure Service Management (ASM). ASM is no longer in use, but it is good to know of its existence and how it differs from ARM. Even if you only use that knowledge to detect and discard outdated online content, it is worth it.
Microsoft introduced Azure Service Manager (ASM) as part of the Azure cloud (then still named Windows Azure) around 2009. ASM was the first HTTP interface provided for managing Azure resources. Before that, while Azure was still in preview, the management of resources was only possible using a web interface now called the classic portal. Looking back, ASM was the first iteration of an interface for managing Azure resources.
ASM has no built-in support for IaC and is rarely ever used in production nowadays. Still, it is good to know what ASM is and to stay away from anything related to it. While the names Azure Resource Manager and Azure Service Management may look similar at first sight, they are nothing alike.
Azure Resource Manager and its ARM templates are the built-in approach for managing infrastructure within Azure. But there are also other tools available for IaC both on Azure or in other public clouds. The next section describes some of them to help you build a broader understanding of IaC.
ARM templates are just one of many IaC approaches available. This section will explore a few other well-known tools to help you understand which tools are available and which one makes sense in which situations.
Note Our focus here is on tools that can be used for IaC in cloud environments. There are other tools available for managing state within virtual machines, such as PowerShell DSC, Puppet, Chef, and Ansible. We won’t be discussing those here.
When considering IaC tools for the cloud, one characteristic is the most important: is the tool single-cloud or multi-cloud? When you’re working in only one cloud, you can consider using the IaC tool specifically intended for that cloud. For Azure, you can use ARM templates; for Amazon Web Services (AWS), you can use CloudFormation; and for Google Cloud Platform, there is the Google Deployment manager. Alternatively, there are multi-cloud options like Terraform or Pulumi. While these tools allow you to manage multiple environments or clouds from a single IaC script, it is also possible to use them when you’re only working with Azure.
We’ll look at all these tools in the next few sections.
CloudFormation is an AWS service for managing infrastructure. Each deployment of a group of resources is called a stack. A stack is a persistent grouping of resources that can span multiple AWS regions or accounts. When you redeploy a template to the same stack, all existing resources in the stack are updated. CloudFormation also deletes resources that are part of the stack but no longer part of the template. Overall, CloudFormation templates are very comparable to ARM templates when it comes to their layout and capabilities.
Google Deployment Manager is the built-in approach to IaC for the Google Cloud Platform (GCP). To deploy a simple set of resources, YAML is used in a very similar layout and style to CloudFormation or ARM templates. However, the Deployment Manager’s YAML configuration is more limited, as it does not allow for parameters, variables, and outputs, like CloudFormation and ARM templates do.
For more advanced features, Deployment Manager allows you to write reusable templates using Python (preferred) or Jinja2. When using Python, the Python language’s full power can be used, including loops, conditionals, and external libraries, to build and return an array of resources. Note that doing so removes the declarative nature from templates. These templates are then imported into the YAML files and deployed from there.
HashiCorp has developed an IaC tool called Terraform. Terraform is open source and is based upon a split between the DSL used for declaring resources and the so-called providers that specify which resources are available for use. The DSL used by Terraform is called HashiCorp Configuration Language (HCL), which defines its structure, syntax, and semantics.
Terraform providers are available for all major cloud providers and other target platforms and tools, including VMware, Azure DevOps, GitLab, Grafana, and many more. Another thing that differs between ARM templates and Terraform is that Terraform uses a state file.
For Azure, there is a Terraform provider developed by Microsoft. This Terraform provider is almost as feature-complete as ARM templates, but it can sometimes still lag in functionality. The reason for this is straightforward: ARM templates use built-in functionality, while functionality needs to be explicitly added to the Terraform provider.
Pulumi differs from most other IaC tools in that it doesn’t use YAML, JSON, or a DSL, but actual program code for managing IaC. Pulumi has language support for Node.js, Python, .NET Core, and Go. Cloud-wise, there is support for Azure, AWS, and GCP. Using one of the supported languages, a model is constructed that represents the desired infrastructure stack. The outcome of the program code, the declaration, is this model that starts the Pulumi engine’s execution.
One of the significant advantages of using an existing programming language for defining infrastructure is that all of the tools and technologies surrounding that programming language are also available for your infrastructure definition. The most prominent example of this is the ability to run unit tests against the infrastructure definition.
Besides supporting all Azure resources, including Azure policies, Pulumi also has a built-in policy engine. This engine allows the use of a single policy engine for more than one cloud. The advantage of this is that you have a single entry point for all policy evaluations. The disadvantage is that the policies are only executed during deployment and not continuously in a deployed environment. Azure Policy, which is the topic of chapter 12, does allow for this continuous evaluation.
When you are consistently working across more than one cloud, you have to choose between using two or more cloud-specific solutions or a single multi-cloud IaC solution.
Cloud-specific solutions often have deeper integration with the underlying platform and provide unique benefits that multi-cloud solutions might not. The downside of using more than one solution is the increased number of tools. On the other hand, multi-cloud solutions can offer specific options that cloud-specific options do not. As an example, look at the policy engine that Pulumi offers. In the end, it is up to you to weigh both alternatives’ pros and cons and make the best decision for your context.
Almost everyone who works with Microsoft Azure has been managing cloud resources one way or another. Typical management of cloud resources includes creating, configuring, and removing resources in Azure. Examples of resources are virtual machines, App Service plans, and storage accounts.
Manually managing cloud infrastructure at scale is tedious, repetitive work that can introduce errors. The elasticity and self-service characteristics of public clouds, the DevOps culture, and the prevention of configuration drift are three drivers toward IaC.
The benefits of IaC for Azure are automation, its declarative approach, and its human-readable nature. These characteristics provide you with repeatable and guaranteed outcomes, ease of understanding, and an auditable and reviewable history of infrastructure changes.
Azure Resource Manager (ARM) is the API or application used for managing Azure resources since 2015. ARM templates are written in JSON, which can be lengthy to write and read. Bicep, a new DSL, has been introduced by Microsoft to overcome this.