Olaf Zimmermann and Mirko StockerContact Co-authors of MAP, DPR, IRC

What is a Cloud-Native Application Anyway? From Analysis to Synthesis

Aug 20, 2021 (Updated: Jan 5, 2022)
Reading time: 10 minutes

Content Outline

Kickoff: Twelve Definitions Distilled
- IDEAL Decomposition, State and Systems Management
Iteration 1: SUPER-IDEAL (Makes Cloud-Native Real)
- SUPER Properties Complementing IDEAL
Iteration 2: From Ten Properties to Seven Traits
Big Picture and Application Examples
Retrospective: Summary and Conclusions
- Acknowledgment
- Notes

In a previous post, we collected twelve definitions of cloud-native. In this post, we derive yet another one from those (and our own experience). Two, actually (“How standards proliferate”).

Kickoff: Twelve Definitions Distilled

IDEAL Decomposition, State and Systems Management

The IDEAL application properties, first postulated by Christoph Fehling and his co-authors of “Cloud Computing Patterns” around 2013, are:

Isolated State, Distribution, Elasticity, Automated Management, Loose Coupling.

Please refer to “What is a Cloud-Native Application Anyway? 12 Definitions Distilled” for explanations.

Iteration 1: SUPER-IDEAL (Makes Cloud-Native Real)

An IDEAL design can hardly be wrong… but maybe there are more desired and defining properties? Let us now add complementary, recurring messages from later definitions.

SUPER Properties Complementing IDEAL

Five more properties also make an application feel at home in the cloud:

Secure and therefore protected
Updatable, including disposable
Polyglot as far as technology is concerned (platform-independent and multi-protocol)
Explicit in terms of its contract and dependencies
Resilient and robust, including observable and monitored¹

Observability and monitoring can be seen to be covered by the A in IDEAL as well.²

Establishing principles and/or calling for certain properties is somewhat easier than satisfying them. Here are some initial thoughts and pointers on how to achieve the five SUPER properties:

Secure and protected: “Secure platform” and “secure by default” are covered in Part 4 of a blog post series by Kyle Brown and Kim Clark; Part 3 has “zero trust”. The OWASP Top 10 on API Security identify related threats and suggest counter measures; one of them is “standard awareness document for developers and web application security”. There is much more to say; cloud application and infrastructure security are fields in their own right.
Updatable including disposable (so retirement as an extreme form of update): Several of the sources listed in the previous post discuss this property, for instance IX. Disposability in the Twelve-Factor App and Part 1 of “What Does Cloud Native Really Mean?”. It is important to be instantly responsive to management commands such as start, stop, re-initialize, log on/off.
Polyglot: The design of a Cloud-Native Application (CNA) should work consistently when deploying to multiple providers or a Hybrid Cloud. Switching from one provider to another should not be complicated by the opportunistic use of proprietary APIs (for instance, directly calling and not wrapping them). Patterns in general are your friend when designing platform-independently; additional advice on types of vendor lockin (and how to avoid it) is available here.
Explicit contract and dependencies: Part 1 of “What Does Cloud Native Really Mean?” motivates this trait. The software architecture and API design and management communities also have a lot to say about it. For instance, elaborate API descriptions including technical service contracts, both human- and machine-readable, should be in place to promote interoperability and to improve the developer experience. A clearly communicated API evolution path including a life cycle management strategy should exist. Package managers and related tools that crawl the Web (or internal repositories) to find required libraries may validate and report their search results (for instance, w.r.t. their open source licenses) before installing them silently.
Resilient and robust: “Release It!” by Michael Nygard shares related advice (which partially predates the cloud age, but still is very valid). “Site Reliability Engineering (SRE)” has an important role to play; the SRE page at Google features various videos, books and tutorials on how to build and operate reliable large-scale services.

Coming up with SUPER was an ad-hoc maneuver in search of a backronym.³ Next up is a more systematic summary of our empirical analysis of definitions and community advice.

Iteration 2: From Ten Properties to Seven Traits

We decomposed and then clustered the elements of the twelve definitions from the previous post (see this online whiteboard). This exercise yielded seven defining traits:

The "CNA Seven" as Application Building Blocks

These seven characteristics cover a lot of ground; they take different viewpoints, from external to internal ones and from runtime qualities to build time (process).

1. Fit for Purpose

This is an outside view on scoping and sizing; emphasis is on purpose:

Fit is the opposite of fat. Any software component, configuration element, etc. should be there for a reason found in the application requirements or context. It is tempting to get carried away by the fascinating possibilities of cloud computing and other technologies; working story- or use case-driven, and speaking the language of the users and other external stakeholders while doing so, is one of the practices that distinguishes engineering from tinkering.
Analysis and design practices such as Domain-Driven Design, both strategic and tactical, help to gain and keep focus. See the (half-serious) post “Driven By Acronyms” for other “driven” software engineering techniques.
As much state as needed and as little as possible should be kept. This state should be pushed to permanent backing services (database, queue, other) quickly and often so that it can be picked up when resuming work after planned or unplanned suspensions.

The I and the L in IDEAL cover parts of this trait, which also reminds us of sound engineering practices that are just as valid outside the cloud as inside. For instance, “purposeful” also is one of five POINT principles for API design in general.

2. Rightsized and Modular

This trait takes an internal view on application structuring. CNAs should be modularized locally and/or be composed of remotely accessible services to balance deployability, changeability and scalability requirements in its business context and domain.

The following patterns and tactics support such rightsizing and modularization:

Blog posts and presentations featuring modular monoliths or “moduliths” promote information hiding and separation of concerns without physical distribution. Note that the call for modularity has been around since the early days of software engineering.⁴
Component- and service-orientation come with numerous related practices; our Design Practice Repository and Reference (GitHub, Leanpub) collects some that we apply often. System decomposition and service cutting remain challenging though (with research opportunities abound).
Message-based remote APIs with clearly defined endpoint roles and well-structured operation responsibilities make applications easier to compose and change. Our Microservice API Patterns and an evolving Interface Refactoring Catalog collect proven API design elements.

Loose coupling between these components/services (as well as interaction of external clients with such components/services) is the L in IDEAL, whose D also covers parts of this trait. One of the autonomy dimensions of loose coupling is platform autonomy; hence, the P in SUPER is related too (see above).

3. Sovereign and Tolerant

With “sovereign” we do not mean fully self-sufficient, but in control of dependencies and local deployment (and other architectural) decisions. To be sovereign, a CNA should also be tolerant and respectful to its neighbors. These neighbors are its inbound and outbound dependencies and communication partners, as well as other tenants of the same cloud.

The following patterns and tactics address this trait:

Virtualization in a broad sense, so including the use of containers and their orchestration, is one way of becoming sovereign and tolerant as far as platform dependencies are concerned.
The well-known CAP theorem investigated some of the conflicts and tradeoffs between (strict vs. eventual) consistency, service availability and network partitioning. Prioritization decisions should be made sensibly and technology chosen accordingly.
The newer Backup, Availability and Consistency (BAC) theorem identifies additional conflicts and tradeoffs. Backing up locally or globally differs in terms of impact on availability and data consistency (for instance, after recovery has taken place).
Idempotence of service APIs should often be strived for, i.e., receiving parts of applications must be able to deal with duplicate messages for instance when reading from queues (with at-least-once delivery guarantees). Postel’s Law advises us to “be conservative in what you do, be liberal in what you accept from others”.
The provider terms of use must be known and respected; Service Level Agreements (SLAs) have to be accepted.

This trait touches upon several letters in SUPER-IDEAL. For instance, the E in SUPER is a facet of it; pointers to related “how to achieve” resources appear further up.

4. Resilient and Protected

A CNA should be resilient and robust as it processes requests from the outside. The fallacies of distributed computing cannot be argued away.⁵ For instance, CNAs must survive belated data arrival and validate all external input before processing it (or passing it on). They must be further secured to be protected against other threats that have been identified.

The following resources help to make a CNA resilient and protected:

Loose coupling of application parts in the time dimension, achieved through asynchronous communication (for instance, queue-based messaging), makes it possible to survive temporary outages. Temporarily storing external data locally and re-initializing application state from event snapshots are other options to become more robust and less brittle (but come at a price: design and test effort increase, and extra computing resources are required).
Deployment tactics and patterns aiming for high availability are compiled in the Availability and Resilience Perspective by Nick Rozanski and Eoin Woods. Circuit Breakers and Bulk Heads may be used to avoid ripple effects in critical situations.
Hyper-scale cloud providers offer region concepts and geographically distributed data storage (and processing). They may support node-based availability and/or environment-based availability, two cloud computing patterns. Deploying to Kubernetes (“K8s”) aims at improving survivability.
A CNA should be secure and compliant by design (“zero trust”). At its interface level, it is critical to mitigate and manage the OWASP “Top 10 Web Application Security Risks” already mentioned above. Industries and organizations typically have their own policies and rules that extend or complements those from OWASP.
Suited compliance controls and should be put in place and then audited; one example are Completeness, Accuracy, Validity and Restricted Access (CAVR) reviews.

Chapters 4 and 7 of “Continuous Architecture in Practice” by Murat Erder, Pierre Pureur and Eoin Woods cover security and resilience as dedicated architectural quality concerns.

The explanations of S and R in SUPER provide more pointers (see above).

5. Controllable and Adaptable

This trait has a systems and service management theme. CNAs should be controllable from the outside and adaptable, requiring/including observability and dynamic scalability.

Service and application management can benefit from:

Monitoring and repair patterns such as Watchdog and Resiliency Management Process combine health checks with semi-automatic reactions to critical situations.
Enterprise Integration Patterns in the System Management category may support adaptivity and help to cope with uncertainty. Three of them are Control Bus, Test Message and Channel Purger.
Concepts from control theory such as the Monitor-Analyze-Plan-Execute (MAPE) loop can be applied. For example, the auto scaling capabilities of cloud offerings can be seen as a form of self adaptation. The articles “Architectural Principles for Cloud Software” and “Controlling the Controllers: What Software People Can Learn From Control Theory” investigate this topic.

This trait corresponds to the E and the A in IDEAL; the U in SUPER gives more pointers.

6. Workload-Aware and Resource-Efficient

CNAs should be frugal (a term from an earlier definition). This includes being aware of cloud rental fees (e.g., CPU, local I/O, network usage) and then making sound decisions that balance cost and benefit w.r.t. functional and non-functional requirements.

The following activities help not to waste resources (both human and system/technical):

Specifying the workload patterns of each feature, component and/or service: Static Workload, Periodic Workload, Continuously Changing Workload, Unpredictable Workload or Once-in-a-Lifetime Workload. Each cloud application part should be designed with this scoping and profiling information in mind.
Measuring resource utilization, using provider dashboards and command line interfaces and/or implementing a “shadow billing” at runtime. Additionally or alternatively, operational expenditures can be predicted by running a suited proof-of-concept or creating/configuring a cost calculator before starting development. Some cloud providers provide tools supporting these activities.
Applying tradeoff analysis methods such as (light or full) ATAM or TARA to compare design options that recur in CNAs (for instance, those references in this post): store vs. get (again), compute vs. store, main memory vs. cloud storage, queue use vs. HTTP and so on. It is possible to integrate the cost impact and consequences of a chosen design (cloud rent but also human resources required, for instance DevOps expert time) into such methods as an additional quality concern.

“Fit for purpose” and “rightsized and modular” are strongly related to this trait, which is not addressed by SUPER-IDEAL directly.

7. Agile and Tool-Supported

Unlike the previous six traits, this one deals with software engineering in general, the processes and practices for crafting CNAs in particular.

Building CNAs is typically supported by (and benefits from):

DevOps incl. Continuous Integration and Deployment (CI/CD).
Many of the factors in the Twelve-Factor App are related. Examples include: I. Codebase, III. Config and X. Dev/prod parity.
Comprehensive unit, integration and system testing — and automation of such tests.

Note that this trait also is not addressed by SUPER-IDEAL directly.

Big Picture and Application Examples

Have we captured the essentials of cloud-native? Did we miss a property or ask for too much? Let’s look at the big picture and then apply the traits in an example to find out.

Traits Summary: The “CNA Seven”

In summary, the nature of a cloud-native application manifests itself in seven traits:

Fit for purpose

Rightsized and modular

Sovereign and tolerant

Resilient and protected

Controllable and adaptable

Workload-aware and resource-efficient

Agile and tool-supported

An Example of Traits in Action: IoT Hub

We might find the following serverless IoT architecture deployed to a public cloud provider such as Amazon Web Services (AWS):

Sample CNA

In this example, SUPER-IDEAL and the seven traits could be achieved as this:

E-Mail Alerts should only go out because this is a functional requirement (or contributes to satisfying one). Stateless and stateful components should be distinguished from each other; in this example, all Function components are stateless and backed by NoSQL Storage, a Relational Database and a Queue.
The serverless Functions in the Ingestion Layer must find a balance between being self-contained and focussed on a single responsibility (in terms of processing they perform or cause elsewhere and data they keep themselves or retrieve). It is possible to configure multiple instances of them, possibly in different data centers and/or cloud regions.
The API Gateway should be able to tolerate temporary outages of (or delayed responses from) the API Implementation that it depends on. The desired data consistency in the two data stores must not be compromised if invalid requests are received or long-running external activities fail to complete in time.
The IoT Hub should not crash if it receives excessive amounts of data from external devices. The API Gateway and the API Implementation must validate traffic coming from the Customer Portal and grant access to the Ingestion Layer and the Relational Database only to properly authenticated and authorized requests (exact segregation of duties to be decided).
The application should respond to management commands sent from the command line or via a user interface (this is not shown in the figure). Each component should log its key activities, for instance according to the recommendations of the Twelve-Factor App. The IoT services of AWS scale automatically when the amount of devices (and/or events emitted by them) crosses a certain threshold and are integrated with AWS CloudWatch Logs. CloudWatch is a monitoring and analytics service that collects metrics and logs from all application components.
The operational expenditure caused by all architecture components (written by the cloud tenant or rented from the cloud provider) must stay within the allocated budget, which is derived from the business case for the system/software/cloud solution. Examples are cloud storage cost, compute cost as well as I/O and networking cost.
Updates to functions and data model (here: both SQL schema and NoSQL Storage structures) should be rapidly deployable so that business events can be responded to. To achieve this, a continuous integration and delivery pipeline should be in place to automate the steps required to build, test and run a new version (including thorough unit, integration, regression and security testing). In our IoT example, all components comprising the stack may be provisioned with a CloudFormation template (the infrastructure-as-code solution from AWS).

Example 2: Network Service Programming App

A second example of a CNA is the “Segment Routing Service Programming application” SerPro application, allowing the customer to program service steering policies via a dedicated GUI; these services are network services such as firewall systems or intrusion detection/prevention systems, typically statically consumed previously.

When the INS Cloud Networking Team started building such applications, those were working fine in their small lab network but failed to work in large networks. The team quickly recognized that only cloud-native applications would meet their requirements, including massive scalability (scaling out and scaling in), auto-scaling, loose coupling (via messaging queues) and survivability (for example with a Kubernetes deployment). The availability of the data in a timely manner is another key property of SerPro, addressed by caching in front of NoSQL databases like ArangoDB; intelligent caching makes the SerPro application more efficient by not repeatedly querying the same data that has not been modified.

We can easily trace these architectural characteristics back to the traits from above.

Retrospective: Summary and Conclusions

Applications that live in the cloud should be engineered well so that they become friendly citizens of their hosting clouds; the software engineering and software architecture communities have been teaching us related principles and patterns for a long time. More specifically, the inherent characteristics of cloud platforms and the value proposition of cloud computing suggest that cloud-native applications should put particular emphasis on quality attributes such as reliability, efficiency and flexibility.

A number of properties help to make sure that an application is able to run in the cloud — and able to benefit from cloud capabilities such as elasticity and pay-per-use: Secure, Updatable, Polyglot, Explicit contract and dependencies, Resilient. These five “SUPER powers” summarize the architecture and design advice not already covered by the IDEAL properties established previously: Isolated State, Distribution, Elasticity, Automated Management, Loose Coupling.

Ten properties might be too difficult to remember (and SUPER-IDEAL might come across as a bit cheesy). So we went down to seven essential characteristics (or traits) of CNAs: Fit for Purpose (and Context), Rightsized and Modular, Sovereign and Tolerant, Resilient and Protected, Controllable and Adaptable, Workload-aware and Resource-efficient, Agile and Tool-supported.

If your on-premises application violates one or more of the “CNA Seven” (or is not SUPER-IDEAL yet), you may want to try one or more of our quality- and smell-driven API Refactorings or apply some Microservice API Patterns 😉.

Do you agree with us? Or would you pick different traits? Let us know.

– Olaf (Zimmermann) and Mirko (Stocker)

The Medium version of this post is “What is a Cloud-Native Application Anyway? 10 SUPER-IDEAL Application Properties and 7 Cloud-Native Traits”.

Acknowledgment

We would like to thank our colleague Laurent Metzger and his team who contributed the SerPro CNA example.

Notes

for instance, “controlled redundancy” might make sense, but must be managed ↩
the L in IDEAL also promotes U, P as well as R, and the A requires U 😌 ↩
this post explains the term (and provides more examples) ↩
check out classical works such as “On the Criteria To Be Used in Decomposing Systems into Modules” or look for the roots of the term “separation of concerns” ↩
both the fallacies and Postel’s Law are frequently cited and taught; they are much older than the cloud computing concepts — which does not mean that they are no longer relevant ↩

Category: Index

Previous: What is a Cloud-Native Application Anyway? 12 Definitions Distilled Next: Event-Driven Service Design: Five Steps from Event Storming to OpenAPI and Camel Flow