Understanding Observability and Opentelemetry in one article

Understanding Observability and Opentelemetry in one article

√Introduction to Observability

√Introduce the core concepts of Opentelemetry

Rethinking Observability

Management guru Peter Drucker once said: "If you can't measure it, you can't manage it." In an enterprise, whether it is managing people, managing things, or managing systems, measurement is the first thing to do. The measurement process is actually a process of collecting information. Only with sufficient information can we make correct judgments, and only with correct judgments can we make effective management and action plans.

Below I use a simple model to illustrate my understanding of observability:

Caption: See the appearance through observation, locate the problem through judgment, and solve the problem through optimization.

Observability describes the continuity and efficiency of the closed loop of "observation-judgment-optimization-re-observation". If there is only observation but no judgment based on observation, it cannot be called observable. If there is only empirical judgment without data support, it cannot be called observable either, which will cause the organization to be highly dependent on personal ability and bring management risks. If there is no feedback to observation after optimization, or if new technologies are introduced during optimization and observation cannot be made, then its observability is unsustainable. If the closed loop of observation, judgment, and optimization requires high costs and great risks, then the value of its observability is negative.

Therefore, when we talk about observability, we are actually considering more about the feelings of observers and managers . That is to say, when we encounter a problem, can we easily find the answer on the observation platform without resistance or confusion? This is observability. As the enterprise develops, the organizational structure (roles, observers) and management objects (systems, observed objects) will develop and change accordingly. When a bunch of traditional observation tools are used but still cannot meet the new needs of observers and managers, we can't help but ask: "Where is observability?"

“Observable” is not the same as “observability”

Next, let’s look at the way we usually observe.

Caption: Traditional observation tools are vertical, and observers need to use multiple tools to judge problems.

Usually we build observation tools based on the data we want. When we want to understand the health of the infrastructure, we will naturally think of building a dashboard to monitor various indicators in real time. When we want to understand how the business went wrong, we will naturally think of building a log platform to filter and check the business logs at any time. When we want to understand why the transaction has high latency, we will naturally think of building a link monitoring platform to query the topological dependency and the response time of each node. This model is very good and has helped us solve many problems, so that we never doubt observability and we are full of confidence. Occasionally, when we encounter a big problem, we open our dashboard, log platform, and link platform. All the data is here, and we firmly believe that we can find the root cause of the problem. Even if it takes a long time, we just tell ourselves to learn more, understand more about the system we are responsible for, and I will definitely find the root cause faster next time. Yes, when the data we want is in front of us, what reason do we have to blame the observation tool?

Caption: The human brain is like a ruler, comparing multiple indicators based on experience to discover their correlations.

Caption: When you find glitches in an indicator, you often need to construct complex log query conditions in your mind, which is not only time-consuming but also prone to errors.

We will work tirelessly to find possible correlations in various indicator data. After getting key clues, we will construct a bunch of complex log query conditions in our minds to verify our guesses. Comparing, guessing, and verifying, while switching between various tools, it is undeniably very fulfilling.

Illustration: When the system becomes large in scale, it is no longer possible for humans to locate the problem.

Traditional systems are relatively simple, and the above approach is effective. Keywords for modern IT systems are distribution, pooling, big data, zero trust, elasticity, fault tolerance, cloud native, etc. They are becoming larger, more sophisticated, more dynamic, and more complex. It is obviously not feasible to rely on people to find the correlation between various information, and then make judgments and optimizations based on experience. It is time-consuming and labor-intensive, and the root cause of the problem cannot be found.

Traditional tools are vertical. When a new component is introduced, a corresponding observation tool is also introduced. This ensures the comprehensiveness of the data, but loses the relevance of the data and the consistency of analysis and troubleshooting (in other words, we monitor all aspects, but when we encounter problems, we still cannot find and locate them well). At this time, we naturally think of making a unified data platform. We imagine that putting all data on a platform can solve the problem of relevance, but often the actual situation is that we just pile up the data in one place, and when we use it, we still use the traditional way to look at it separately. We just merged countless pillars (tools) into three pillars: a unified platform for observation indicators, logs, and links. The data is unified, but the relevance still depends on human knowledge and experience.

The key here is to solve the problem of data association, and let the program handle the things that previously required human comparison and filtering. Programs are best at such things and are also the most reliable, and people can spend more time on judgment and decision-making. In complex systems, the time saved will be magnified many times, and this small thing is the visible future of observability.

Illustration: Future observation tools will need to correlate data across time and context

So, how do we associate data? It's easy to say, that is to associate time and space. On our unified data platform, since the data comes from various observation tools, although we have unified the data format into metrics, logs, and traces, the metadata of metrics, logs, and traces of different tools are completely different. If we sort out and map these metadata on this unified data platform, it will be complicated, difficult to maintain, and unsustainable. So how to do it? The answer is standardization. Only by feeding standardized and structured data to the observation platform can the observation platform discover huge value from it. The unified data platform only standardizes the data format, and in order to associate traces, metrics, and logs, it is necessary to establish context standardization. Context is the spatial information of the data, and the association of time information can bring out the real observation value.

What does Opentelemetry do?

Opentelemetry (hereinafter referred to as: OTel) is a project to solve the problem of data standardization. OTel consists of the following parts:

  • Cross-language standard specifications: defines the specifications of data, context, API, concept terms, etc. This is the core of OTeL, which organically unifies all observation data so that the observation platform can automatically compare and filter, while also providing high-quality data for AI.
  • Tool for receiving, processing, and outputting observation data (Collector) : A tool for receiving OTeI observation data, and supports processing the observation data by configuring the pipeline and outputting it to the specified backend.
  • SDKs in various languages: SDKs in various languages ​​implemented based on the OTeI standard API to support custom development of observation data collectors.
  • Instrumentation: Out-of-the-box instrumentation for observation data.

OTel is an open source project, and all its contents can be found on Github. Here are some key concepts: Attributes From the data perspective, attributes are key-value pairs. In essence, attributes describe spatial information, which facilitates data association in space. OTel defines many common attributes. If the definition is not clear or the data is inconsistent, it is impossible to automatically associate and analyze. The following are the K8S Pod attributes defined by Otel:

resource

From the data perspective, resources are a set of key-value pairs. Essentially, resources describe observation objects. Metrics, logs, and traces of the same observation object have the same resource data (or the same context), so correlations can be automatically discovered.

event

From the data perspective, an event is a combination of a timestamp and a set of attributes, which is used to describe what happened at a certain time. In essence, an event is a combination of time and space.

index

From a data perspective, indicators are aggregations of events. In an active system, the same events will continue to occur, and indicators provide an overview across time and space. Immersing yourself in the details may not necessarily lead to insights, but stepping out and taking a bird's-eye view from a higher dimension may lead to inspiration.

span

From the data perspective, span consists of: operation name, start time, duration, and a set of attributes. Span (also known as span) describes a process. If an event builds the correlation of time and space at a point in time, then span builds the correlation of time and space over a period of time.

Signal

Signals are an abstraction of standard telemetry data. Data with the same data model is grouped as one signal. For example, a metric is a signal, and all metrics have a unified standard data model. A trace is a signal, and all traces have a unified standard data model. An important feature of signals is that they are vendor-independent. Any observable system vendor that supports OTeI must collect, report, and process data according to OTeI's signal model. This is the key to ensuring efficient data association.

Context

All signals are based on the same context, for example, metrics, logs, and traces collected in the same service have the same context (such as service.id and service.name). This is actually the association of data established in space.

Awe Engineering

Otel provides standard specifications and many ready-to-use tools at the data level, which greatly facilitates the construction of observable platforms. However, it is a big project to actually build an observable platform that is suitable for you, fully scalable, stable, reliable, low-cost and high-efficiency. It is not something that can be simply introduced. This involves system problems such as big data engines, high-cardinality analysis engines, relationship engines, and AI engines. In addition, how to design a simple, efficient, accurate, collaborative, and professional platform is not something that can be achieved overnight. It requires understanding of data, technology, and design.

I divide observability platforms into the following levels:

  1. Data display + manual correlation comparison + manual judgment: Most traditional observation platforms are at this level.
  2. Information correlation display + manual judgment: Some observation platforms can make some correlation displays through combing and mapping, reducing the time cost of manual discovery.
  3. Information judgment x manual judgment: A very small number of observation platforms have highly standardized data and can provide insights and suggestions based on relevance.
  4. Information judgment + action: No observation tool can make judgments based solely on the tool.

Borei Data has accumulated more than ten years of technology in the data collection layer. The probes are stable and reliable, and the deployment is simple. In terms of data processing, it has also withstood the test of customers with large business volumes, and continuous technical innovation has formed a very advantageous architecture. It has also formed its own system in data standardization and structured design. It can be said that we have just crossed the second layer to the third layer. We will enrich the standardized data from the two aspects of observation breadth and depth. Based on this, we will continue to deepen the data relevance. With the empowerment of our self-developed SwiftAI middle platform, we will provide more and more accurate information judgments in the future to help customers quickly implement efficient and sustainable observation-judgment-optimization closed loop. ​

<<:  ​From CDN to edge computing, computing power evolution accelerates again

>>:  How to share WiFi gracefully when the password is hard to reveal

Recommend

How does CDN work? Do you know?

A content delivery network (CDN) is a set of serv...

Tencent Cloud Lighthouse Care, help you get up to 200 yuan in coupons

Tencent Cloud recently launched a lightweight wor...

WiFi 7 is coming, but is it really reliable?

There is no fastest, only faster. WIFI6 has just ...

Deutsche Telekom recently successfully tested 5G independent networking

Deutsche Telekom (DT) said it has completed its f...

Since the advent of 5G, 4G network speed has become slower and slower?

With the arrival of 5G, many people have intuitiv...

How to balance the development of Wi-Fi 7 and future 5G/6G?

As society progresses, people's demand for in...

How can enterprises ensure that SDN deployment is effective?

[[177483]] In recent years, companies ranging fro...

Review: China ranks first in 5G mobile phone sales, is the world happy too?

After the Spring Festival of the Year of the Ox, ...

Inspur Network Electronics Range Training Base officially launched

Recently, the "Inspur Network Electronic Tar...

Don’t rush to fight for 5G

Recently, discussions about 5G have been everywhe...