1 Infrastructure as Code

This chapter covers

If you have worked with Microsoft Azure before, you may have managed infrastructure in one way or another. In Azure, just as in any cloud platform, infrastructure can be created and altered quickly and easily. Using one or more of the Azure portal, PowerShell cmdlets, RESTful HTTP calls, SDKs, or ARM templates, you can create servers or PaaS and SaaS services in minutes or even seconds. This is in contrast to how infrastructure was managed in the past, or often still is on-premises.

The unique proposition of the cloud has transformed the way we create and operate software in the last decade. In particular, the way we manage infrastructure that runs applications has changed. Creating cloud infrastructure on demand and discarding it hours or days later has become a common approach, especially for test environments.

Two characteristics of the cloud, in particular, have accelerated this change:

Elasticity is a characteristic of cloud computing. In the context of the cloud, elasticity is the capability to quickly add or remove resources from your infrastructure. Unlike traditional server deployments, clouds allow you to pay for infrastructure by the hour, minute, or even second, which allows for flexibility and encourages different approaches to provisioning infrastructure.

Self-service is a second characteristic. All of the major cloud vendors provide their users with graphical user interfaces (GUIs), command-line interfaces (CLIs), and APIs that they can use to create, update, and remove instances of the services they make available. Nowadays, all cloud providers use an API-first strategy, and one outcome of this is that every operation is also available through their management APIs, not just through the user interface or other tools.

The combination of these characteristics causes us to treat cloud infrastructure differently than traditional on-premises infrastructure. Spinning up complete configurations spanning tens of services can now be done in a matter of minutes. You can do this either using the major cloud providers’ portals or by using scripts in your deployment pipelines.

However, using portals or CLIs to do this does present downsides—it is challenging to manage your cloud infrastructure reliably over time. Examples include changes being incompletely tracked, developers needing to access production environments with personal accounts, and many more. For this reason, another approach to managing infrastructure has become the go-to option for most teams: Infrastructure as Code (IaC.)

In this chapter, you’ll learn more about managing cloud infrastructure in general and about the benefits of using IaC over manual and scripted approaches. Then we’ll look at the Azure Resource Manager (ARM), the service that you interact with to manage your infrastructure in Azure, and at a few other tools for managing Azure infrastructure.

1.1 Working with infrastructure

Infrastructure as Code (IaC) is a modern approach for managing infrastructure. Instead of creating and configuring infrastructure manually, using graphical interfaces, all infrastructure is described in configuration files that are then used to create the infrastructure automatically. For Azure, IaC is written in Azure Resource Manager (ARM) templates or Bicep files, which are submitted to ARM for processing.

When we talk about infrastructure in the context of Azure, we are referring to all Azure resources that you can use as part of your solution architecture. An obvious example would be a virtual machine or storage account, but infrastructure in the context of IaC also includes service bus messaging queues, dashboards, app services, and any other deployable Azure resource.

Before we dive into the background of ARM and the benefits of IaC, let’s look at an example. Figure 1.1 shows a small snippet of an ARM template and how it can be used to create Azure resources, like an Azure storage account.

Figure 1.1 From ARM template to Azure infrastructure

ARM templates (at the left of figure 1.1) are formal descriptions of what infrastructure needs to exist and how it is configured. These templates are then applied to an Azure environment, creating the infrastructure described (at the right of figure 1.1). If a resource with the specified name and type already exists, its configuration is updated instead.

As you have already seen, the characteristics of the public cloud encourage the use of IaC, but that’s not the only reason for using IaC. Two other drivers are the DevOps culture and the desire to prevent configuration drift. The next sections discuss these two topics in detail.

1.1.1 DevOps

DevOps is a cultural movement that aims at breaking down the traditional barriers between development and operations teams. In a traditional organization with operations and development teams, the two types of teams have clear responsibilities.

Figure 1.2 shows what this looks like. Here you see a dedicated operations team that manages infrastructure and other runtime components. A separate development team writes updates and hands them over to operations for deployment. If such an update requires a change in the infrastructure, this has to be requested beforehand, often well in advance. These infrastructure changes have to be coordinated between teams and are often slow to complete.

Figure 1.2 Development and operations teams coordinating on infrastructure changes

In many organizations, the opposing goals of these teams or even of complete departments lead to unhealthy situations, such as these:

Of course, this causes problems for the organization as a whole, which is served best by controlled, well-coordinated changes that implement new requirements while the existing infrastructure and applications keep running smoothly.

The DevOps movement advocates that developers and operators should work together in a single team toward this shared goal: the continuous delivery of high-quality software that creates business value. The subgoals of stability and change should be committed to by this single team that combines operations and development expertise. While doing this, a DevOps team often adopts development practices to perform operational duties.

In practice, this means that a new, now-combined, DevOps team is responsible for creating their own infrastructure (see figure 1.3). Often this also means that IT professionals start to apply development techniques to their day-to-day work. They transition from the user interface and manual application and the verification of changes to adopting advanced text editors, automated installation scripts, and IaC. IaC allows developers and operators to work together to describe and configure any infrastructure needed by their application deployment. Together they can promote the infrastructure changes and the application artifacts to a test environment, and after verification to a production environment.

Figure 1.3 A DevOps team is aligned with the applications and infrastructure it is responsible for managing.

1.1.2 Preventing configuration drift

Next to self-service APIs and DevOps, another driver for the adoption of IaC is the prevention of a problem called configuration drift. Configuration drift is a phenomenon observed on infrastructure that is managed manually. It doesn’t matter if it is managed through the command line or a graphical interface—configuration drift can happen in both cases.

Configuration drift refers to differences that develop over time on either of two dimensions:

To see how this configuration drift can occur, imagine an infrastructure configuration of two identical virtual machines (VMs), with one being a test and the other a production environment. The two environments should be configured in precisely the same way, because the test environment is also used for load and stress testing.

Figure 1.4 illustrates two types of configuration drift. First, there is an unintended difference between the test and the production environment, as the production VMs have more memory allocated than the test VMs. Second, there is a drift within the production environment, as one of the VMs has four cores instead of the desired two.

Figure 1.4 Two types of configuration drift: between environments and within an environment

Configuration drift is often the result of an unexpected, incomplete, or incorrectly executed change. When a change is required to the configuration of any infrastructure component, that change must be applied to each instance of the infrastructure, one by one. But other things can happen as well:

Differences between environments like these can cause future problems. For example, test results from the test environment will no longer be representative of how a particular change will affect the production environment. Given enough time, configuration drift will affect any environment and result in unpredictable behavior. IaC can help remediate configuration drift by re-applying the infrastructure specification regularly. Because all settings are stored in source control and applied automatically, all changes are detected and corrected automatically.

We’ve mentioned three main drivers for using IaC—namely the cloud, DevOps, and the prevention of configuration drift—but there are still other benefits of IaC. Let’s take some time to explore the benefits IaC offers over managing infrastructure manually or through scripts.

1.2 The benefits of Infrastructure as Code

Once a team moves to IaC, often because of one of the drivers we’ve already discussed, they will also start observing other benefits. As with many developments in our field, this change will not only help to overcome existing problems but will also inspire new ways of working.

These are three common benefits:

The next three subsections discuss these benefits in turn.

1.2.1 IaC allows for automation

As you may have guessed by now, IaC is applied using tools, and tools imply automation. This delivers two additional benefits, besides saving time: guaranteed outcomes and environment reproducibility.

Guaranteed outcomes

Automatically creating and configuring environments not only saves time, it also provides guaranteed outcomes. When creating and configuring an Azure virtual machine manually, there are tens if not hundreds of configuration options that have to be checked. In practice, this is very error-prone work, and mistakes are very likely to happen. Asking five different people to create an Azure VM with 4 CPU cores, 8 GB of memory, and running Windows Datacenter 2019, will most likely result in five virtual machines all configured differently.

With IaC, this is not the case. After you write the desired infrastructure in a code file, the same file can be applied repeatedly, and the IaC tools guarantee that the outcome is the same every time. Verifying configuration or testing outcomes is no longer necessary when working with IaC. It not only saves a lot of time; it also improves quality.

Environment reproducibility

Once an IaC file is written, the cost of creating the described infrastructure is almost zero. It is just a matter of starting the tool, and the required infrastructure resources are created and available a few minutes later. This unlocks all kinds of new approaches to testing, deploying, and running infrastructure.

Just the ability to automatically remove development and test environments at 6 P.M. and re-create them automatically at 7 A.M. on working days can save organizations anywhere between 30% and 60% of their infrastructure costs, compared to keeping infrastructure running 24/7.

Also, if you have ever been responsible for test infrastructure, you’ll know how hard it is to keep test infrastructure usable. Every test failure can pollute the infrastructure and trigger false test failures in the future, due to the inconsistent state of the previous run. Just imagine the possibility of creating new infrastructure, in a guaranteed state, before each test run starts. And all of this at no additional cost. The reduction in false test failures will save a lot of time, money, and negative energy spent by you and your team.

1.2.2 IaC allows for a declarative approach

IaC can be written in two different styles: declarative and imperative. With the declarative style, the source files describe the desired state of the infrastructure. The execution engine is then responsible for comparing the desired state with the actual state, determining the differences, and identifying and executing a series of commands to make the actual state correspond to the desired state.

This approach is similar to Structured Query Language (SQL). You can use SQL to describe which records should or should not be in your result, rather than having to specify the commands to execute. The database engine is then responsible for determining which commands should be executed to reach that desired result.

With the imperative style, you do not describe the intended end result but instead describe the series of commands, steps, or program code to execute.

Note The term Infrastructure as Code is also used for approaches where scripts are stored in source control. While this is a correct use of the term, most IaC approaches, including ARM templates, use a declarative approach.

The first benefit of a declarative approach is that it enhances both the ease of writing and the ease of reading. Writing in a declarative style is easier, because the writer does not have to worry about how the infrastructure is created. They just need to describe what is needed in the end, and the tool translates this into the how. This applies both to when infrastructure is created the first time and when infrastructure configuration is updated. In an imperative approach, this would result in a lot of if-then-else coding; in a declarative approach, if-then-else is not necessary. As an example, see these declarative statements:

There should be a car
The car should be green
The car should have four wheels

Compare that with these imperative statements:

If there is no car
        Create a car
If the car is not green
        Make the car green
While the car has more than four wheels
        Remove a wheel from the car
While the car has fewer than four wheels
        Add a wheel to the car

As this example shows, the declarative style improves the ease of writing and also enhances reading ease, as it focuses solely on the desired state.

The second benefit of a declarative approach is that the execution engine can be optimized without impacting the IaC declarations. In the similar case of SQL statements, SQL database engines have seen significant changes and optimizations over the last few decades, but most SQL statements written years ago still execute without any changes.

1.2.3 IaC provides a human-readable format

The third benefit of IaC is that it leverages human-readable formats. Some IaC tools use JSON or YAML, and others use a custom domain-specific language (DSL) or existing programming language. Azure Resource Manager templates use JSON, which stands for JavaScript Object Notation. This human-readable format provides us with a version-controllable, auditable, and reviewable definition of application infrastructure. On top of ARM templates, an easier, more approachable, DSL has been introduced, called Bicep.

Version controllable

Human-readable, non-binary files can be stored in a source control system, just like source code for an application. Source control systems provide users with a centralized, single source for the latest version of a file, along with a full history of all changes. Gone are the days when you had to record all the infrastructure changes manually to go back and find out which changes were made when, by who, and why. With source control, you automatically have the complete change history readily available. Another consequence of this is that if there is ever the need to roll back a change, the previous configuration can quickly be restored.

Auditable

IaC files are readable and all changes are recorded in source control, which makes them instantly auditable by security reviewers, external auditors, and any other party interested in the changes you are making. Source control provides a full audit log of all the changes made and by whom.

Reviewable

Source control systems allow you to automatically enforce standards before any change is made final. This can include automated formatting checks, automated build checks, or even enforced peer reviews—this functionality is built into most source control systems.

Now that you know about the extra benefits you can get with IaC, let’s turn to the Azure Resource Manager. Azure Resource Manager is Azure’s service for working with IaC.

1.3 The Azure Resource Manager

We’ve discussed the drivers and benefits for IaC, so it’s now time to dive a bit deeper into the IaC system for Azure. The first thing to understand here is that all Azure infrastructure management is done using the Azure Resource Manager (ARM). ARM is a RESTful HTTP API that you can call to list, create, update, and delete all resources in your Azure subscriptions. If you interact with Azure through the portal, the CLI, or Azure PowerShell, you are also using ARM under the hood.

ARM is the basis for the Azure IaC capabilities provided via ARM templates. ARM is the execution engine for IaC. But before we dive into ARM templates, it is important to know what the control plane and data plane are, how they differ, and what you can and can’t do with ARM templates.

1.3.1 Control plane and data plane

Each interaction you have with Azure is either a control plane or a data plane operation. Simply put, you use the control plane to manage resources in your subscription, and you use the data plane to employ the capabilities exposed by your instances of specific resource types. In Azure, there is a single, unified control plane: the Azure Resource Manager.

To make the difference between the control plane and data plane clearer, here are a few examples:

Requests sent to the control plane are all sent to the Azure Resource Manager URL; for the global cloud, that is https://management.azure.com. From this URL, it is possible to build complete URLs that identify any Azure resource. For example, the following pseudo URL points to a virtual machine:

GET https:/ /management.azure.com/subscriptions/{subscriptionId}/resourceGroups/
     {resourceGroupName}/providers/Microsoft.Compute/virtualMachines/
     {virtualMachineName}?api-version=2021-04-01

Suppose you are logged into the Azure portal and you copy this URL into your browser with valid values for subscriptionId, resourceGroupName, and virtualMachineName. The response would be a JSON description of the virtual machine. If you study the response in detail and compare it to an ARM template for virtual machines, you’ll quickly notice that they are the same (with only a few default properties omitted).

Interactions with a resource on the data plane always happen on an endpoint specific to that resource. This means that data plane operations are not limited to REST but could use HTTPS, FTP, or any other protocol. Interactions with the control plane happen through the ARM APIs or through ARM templates.

1.3.2 ARM templates

The ARM APIs can be used to manage infrastructure in an imperative style, using provisioning scripts. If you prefer a declarative style, ARM templates are available.

ARM templates are written in JSON or Bicep and are used for any of the following purposes:

If you already have a basic understanding of the Azure hierarchy, the preceding list will show that you can completely manage Azure using ARM templates. If you don’t understand all the terms mentioned here, don’t worry—all these concepts will be explained in more detail in chapter 3.

While ARM templates are compelling and they allow you to manage all of Azure, an often-heard complaint is that they can be challenging to write and pretty verbose to read. To provide a solution to this, Microsoft recently launched project Bicep.

1.3.3 The Bicep language

ARM templates are written as JSON files, but one of the disadvantages of JSON is that it can become quite lengthy when expressing complex structures. This lengthiness can make files difficult to maintain or read. Another downside of JSON is that there is no out-of-the-box support for control structures like loops or conditions. While ARM provides workarounds for this, ARM templates do take a while to master.

To provide a solution to these problems, Microsoft has introduced a new domain-specific language (DSL) as an alternative way to write ARM templates. This DSL is called Bicep, a play on the name ARM. Chapter 6 discusses Bicep in depth.

1.3.4 Azure Service Management (ASM is not ARM)

Before the existence of the Azure Resource Manager, another system was available for managing resources within Azure: Azure Service Management (ASM). ASM is no longer in use, but it is good to know of its existence and how it differs from ARM. Even if you only use that knowledge to detect and discard outdated online content, it is worth it.

Microsoft introduced Azure Service Manager (ASM) as part of the Azure cloud (then still named Windows Azure) around 2009. ASM was the first HTTP interface provided for managing Azure resources. Before that, while Azure was still in preview, the management of resources was only possible using a web interface now called the classic portal. Looking back, ASM was the first iteration of an interface for managing Azure resources.

ASM has no built-in support for IaC and is rarely ever used in production nowadays. Still, it is good to know what ASM is and to stay away from anything related to it. While the names Azure Resource Manager and Azure Service Management may look similar at first sight, they are nothing alike.

Drawbacks of Azure Service Management

The lack of support for IaC was not the only reason Microsoft replaced ASM. Other drawbacks include the lack of grouping options for resources, no options for managing authorizations at the individual resource level, the lack of a fine-grained permission set, and many more.

Azure Resource Manager and its ARM templates are the built-in approach for managing infrastructure within Azure. But there are also other tools available for IaC both on Azure or in other public clouds. The next section describes some of them to help you build a broader understanding of IaC.

1.4 Other tools

ARM templates are just one of many IaC approaches available. This section will explore a few other well-known tools to help you understand which tools are available and which one makes sense in which situations.

Note Our focus here is on tools that can be used for IaC in cloud environments. There are other tools available for managing state within virtual machines, such as PowerShell DSC, Puppet, Chef, and Ansible. We won’t be discussing those here.

When considering IaC tools for the cloud, one characteristic is the most important: is the tool single-cloud or multi-cloud? When you’re working in only one cloud, you can consider using the IaC tool specifically intended for that cloud. For Azure, you can use ARM templates; for Amazon Web Services (AWS), you can use CloudFormation; and for Google Cloud Platform, there is the Google Deployment manager. Alternatively, there are multi-cloud options like Terraform or Pulumi. While these tools allow you to manage multiple environments or clouds from a single IaC script, it is also possible to use them when you’re only working with Azure.

Multi-cloud or single-cloud

There is much debate around the topic of multi-cloud strategies. There are both pros and cons for working with only a single cloud provider or working with more than one vendor. This discussion is out of scope for this book, but when weighing your options and determining your strategy, you should consider your IaC options.

We’ll look at all these tools in the next few sections.

1.4.1 AWS CloudFormation

CloudFormation is an AWS service for managing infrastructure. Each deployment of a group of resources is called a stack. A stack is a persistent grouping of resources that can span multiple AWS regions or accounts. When you redeploy a template to the same stack, all existing resources in the stack are updated. CloudFormation also deletes resources that are part of the stack but no longer part of the template. Overall, CloudFormation templates are very comparable to ARM templates when it comes to their layout and capabilities.

1.4.2 Google Cloud Deployment Manager

Google Deployment Manager is the built-in approach to IaC for the Google Cloud Platform (GCP). To deploy a simple set of resources, YAML is used in a very similar layout and style to CloudFormation or ARM templates. However, the Deployment Manager’s YAML configuration is more limited, as it does not allow for parameters, variables, and outputs, like CloudFormation and ARM templates do.

For more advanced features, Deployment Manager allows you to write reusable templates using Python (preferred) or Jinja2. When using Python, the Python language’s full power can be used, including loops, conditionals, and external libraries, to build and return an array of resources. Note that doing so removes the declarative nature from templates. These templates are then imported into the YAML files and deployed from there.

1.4.3 Terraform

HashiCorp has developed an IaC tool called Terraform. Terraform is open source and is based upon a split between the DSL used for declaring resources and the so-called providers that specify which resources are available for use. The DSL used by Terraform is called HashiCorp Configuration Language (HCL), which defines its structure, syntax, and semantics.

Terraform providers are available for all major cloud providers and other target platforms and tools, including VMware, Azure DevOps, GitLab, Grafana, and many more. Another thing that differs between ARM templates and Terraform is that Terraform uses a state file.

A state file or cache

What ARM, CloudFormation, and Deployment Manager have in common is that they operate on the difference between the desired state (the template) and the actual state of the resources. The changes they make to the cloud environment are determined by comparing these two.

Another group of IaC tools operates on the difference between the desired state and a state file. A state file is a file or cache that captures what the tool believes the cloud environment’s state is after the previous deployments. The changes it makes to the cloud environment are determined by comparing these two.

IaC tools use a state file to quickly decide which changes should be made without querying the complete actual state from the cloud environment. The risk of this approach is that there might be mismatches between the state file and the actual state, resulting in an incorrect execution. To counter this, tools that use a state file often allow for updating the state file from the actual state.

For Azure, there is a Terraform provider developed by Microsoft. This Terraform provider is almost as feature-complete as ARM templates, but it can sometimes still lag in functionality. The reason for this is straightforward: ARM templates use built-in functionality, while functionality needs to be explicitly added to the Terraform provider.

1.4.4 Pulumi

Pulumi differs from most other IaC tools in that it doesn’t use YAML, JSON, or a DSL, but actual program code for managing IaC. Pulumi has language support for Node.js, Python, .NET Core, and Go. Cloud-wise, there is support for Azure, AWS, and GCP. Using one of the supported languages, a model is constructed that represents the desired infrastructure stack. The outcome of the program code, the declaration, is this model that starts the Pulumi engine’s execution.

One of the significant advantages of using an existing programming language for defining infrastructure is that all of the tools and technologies surrounding that programming language are also available for your infrastructure definition. The most prominent example of this is the ability to run unit tests against the infrastructure definition.

Besides supporting all Azure resources, including Azure policies, Pulumi also has a built-in policy engine. This engine allows the use of a single policy engine for more than one cloud. The advantage of this is that you have a single entry point for all policy evaluations. The disadvantage is that the policies are only executed during deployment and not continuously in a deployed environment. Azure Policy, which is the topic of chapter 12, does allow for this continuous evaluation.

1.4.5 Choosing between cloud-specific and multi-cloud solutions

When you are consistently working across more than one cloud, you have to choose between using two or more cloud-specific solutions or a single multi-cloud IaC solution.

Cloud-specific solutions often have deeper integration with the underlying platform and provide unique benefits that multi-cloud solutions might not. The downside of using more than one solution is the increased number of tools. On the other hand, multi-cloud solutions can offer specific options that cloud-specific options do not. As an example, look at the policy engine that Pulumi offers. In the end, it is up to you to weigh both alternatives’ pros and cons and make the best decision for your context.

Summary