Observability of software design - What it is and why it matters
As developers, we have lots of tools for finding out how code works in detail, and rooting out problems. Debuggers, profilers, Wireshark, and their ilk are great for finding and fixing bugs, and also for building knowledge about how software works.
But how do we build understanding and troubleshoot the behavior of software from a higher level? Software design is just as important as implementation detail, if not more so. Want proof? Check out the Top 25 Most Dangerous Software Weaknesses, and note this key observation:
The biggest movement up the list involves four weaknesses that are related to Authentication and Authorization
- CWE-522 (Insufficiently Protected Credentials): from #27 to #18
- CWE-306 (Missing Authentication for Critical Function): from #36 to #24
- CWE-862 (Missing Authorization): from #34 to #25
- CWE-863 (Incorrect Authorization): from #33 to #29
All four of these weaknesses represent some of the most difficult areas to analyze a system on.
(My emphasis added)
One reason why these design-related weaknesses are becoming more and more important is that the tools for detecting low-level weaknesses are getting better. In our ongoing survey of State of Software Architecture Quality, 65% of respondents indicate that they are using static code analyzers to find and fix low-level flaws, and another 22% say that they want to start using these tools. As a result, the impact of simple coding errors like CWE-835 (Loop with Unreachable Exit Condition) and CWE-704 (Incorrect Type Conversion or Cast) is falling, while design flaws are rising in importance.
Do these flaws have consequences in the real world? They absolutely do. Take a look at these three examples:
- NHS improperly distributes information on 150,000 patients
- Facebook gave up credentials to 30 million users
- Citibank just accidentally distributed $500 million after an app confused its users with bad usability
Each of these mistakes inflicted $10s to $100s of millions of dollars in damage. It might seem like they are so big that they are beyond the scope of a single developer to find and fix. But there’s a funny thing about software, and about data loss and breaches in particular. All the systems listed above had safeguard mechanisms in place. It took multiple failures for these mishaps to occur; and if only one of these problems had been identified and fixed by an observant developer, the problem may not have occurred.
In all likelihood, these doomed apps were built by competent teams of developers, working using modern methods and tools. It shouldn’t take a team of geniuses to move health data, or money, or cat photos around the Internet. These are tasks that should be achievable by, dare we say it, “average” development teams. So I personally refuse to blame the personnel involved. In that case, where does the root cause lie? If you have an important flaw in a system, and competent people are looking at that system, yet they can’t see the flaw, then the flaw must be insufficiently visible.
In the business world, they call this an “observability problem”. Observability tools are commonly deployed to monitor and manage network traffic volume, response time, latency, error rate, and many other key metrics. Teams of site reliability engineers (SREs) operate observability systems on behalf of the business.
The job of observability tools is to make uncommon events obvious. They do this by observing the external outputs of a system, then suppressing unimportant details and emphasizing important features. One request out of a few thousand may fail, latency may be inching above a threshold on one network. These signals, which may seem small, are often the harbinger of bigger problems to come. Observability tools watch for these early signs, and allow people to take action early and head off bigger problems.
Catching and stopping problems early is good! So, as software developers, what tools do we have today for observability of software design and architecture? How can we get early signals that our system may be at risk due to bad design decisions or degredation? How can we get that information while the code is in development rather than finding out the hard way, in production?
If you look at a typical dependency map or trace diagram of a software system, it seems to be composed of functional parts, just like a machine. There are subsystems for ingesting data, enforcing authentication and authorization, sanitizing input, transforming and storing data, sending messages, scheduling tasks for future execution, etc.
To treat a software system like a machine, an observability system for software design needs to:
a) Recognize the components that comprise a software system. b) Understand how components are designed to be used. c) Recognize and flag anomalous use or configuration of these components. d) Filter out the “noise” of low-level code changes, and provide a high-level “signal” of design changes which you can receive and review.
In order to help make this a reality, we have created the open source AppLand Framework. The AppLand Framework includes components to:
- Record end-to-end code and data flows.
- Save these recordings as JSON files called AppMaps.
- Automatically visualize AppMaps as dependency maps, execution traces, and more.
- Bring all of this capability into your code editor, using the AppMap extension for VS Code.
The next frontier for the AppLand Framework is observability for software design. AppMaps provide a unique type of data about software. Problems which can’t be solved using static code analysis can be solved using simple algorithms on AppMap data. Those difficult-to-analyze software weaknesses can be found. By carrying this idea forward, we aim to make software design safer, more secure, and higher quality than ever before. And we want to empower all developers to see, understand, and critique the design of the systems that they work with. To join us, please check out the links above, and say “hello” on the AppLand Discord server. Thanks for reading, and we hope to meet you soon!
Originally posted on Dev.to