Gateway programming: How to reduce R&D costs through user gateways and caches?

Gateway programming: How to reduce R&D costs through user gateways and caches?

If the user's traffic is like the surging waves, then the gateway is the dam to defend against the impact. In large-scale Internet projects, the gateway is indispensable and is our best defense method at present. Through the gateway, we can divert a large amount of traffic to various services. If we use some capabilities provided by the Lua script engine, we can also greatly reduce the coupling degree and performance loss of the system and save our costs. Generally speaking, gateways are divided into external network gateways and internal network gateways. The external network gateway is mainly responsible for current limiting, intrusion prevention, request forwarding and other tasks. The common way is to use Nginx + Lua to do similar work; in recent years, the development of the internal network gateway has seen the emergence of various customized gateways, such as ServiceMesh, SideCar and other methods, as well as Kong, Nginx Unit, etc. Although their uses are different, their main functions are still to do load balancing, traffic management scheduling and intrusion prevention.

External network gateway function

Let's start with the usage of the external network gateway. I will share with you two types of practical designs of external network gateways. The two designs can help us prevent intrusion and contact business dependencies.

Spider sniffing identification

When dealing with high-traffic websites, common security issues include illegal references and robot crawling. To prevent these problems, we can adopt some effective strategies, such as implementing rate limiting and intrusion detection functions through the gateway.

Illegal reference prevention: Illegal references often lead to the abuse of our website resources. A common prevention method is to check the referer field in the request. If the referer of the request is not the domain name of this site, the request is rejected. This can effectively reduce the risk of unauthorized access to resources.

Robot crawling prevention: Robot crawling is another common problem. To identify and prevent robot crawling, we can take the following approaches:

  1. Analyze anonymous user requests: For anonymous users, we can identify IP addresses with abnormal request frequencies by counting their IP addresses and request time periods. High-frequency request IPs should be marked as key focus objects.
  2. Analyze the behavior of logged-in users: For logged-in users, we can count the number of user requests by time block. If a user's request frequency exceeds the normal range, we can reject their request and put them on the "suspicious list" for follow-up investigation.
  3. Dynamic injection of JS sniffing code: In order to identify robots more accurately, we can dynamically inject JS sniffing code at the gateway layer for suspicious users or IPs. The code will write specific ciphertext in the user's Cookie or LocalStorage, and require the front-end JS code to detect the ciphertext. If the ciphertext exists and meets the specific pattern, the front-end will enter "anti-robot mode" to check whether the client has any mouse movement or click behavior. This detection can help us confirm whether the user is manually operating or robot behavior. If the user does not take corresponding actions for a long time, the system will list it as a to-be-banned object and block the request.
  4. Whitelist strategy: Since this anti-robot design may affect search engine crawling, especially unfriendly to SEO (search engine optimization), we can adopt a whitelist strategy to allow some common search engine robots to access the website. We can release the crawlers of mainstream search engines based on UserAgent and regularly audit their IP addresses.
  5. Interface signature strategy: For some core interfaces, the rule of "must be accompanied by a timestamp signature" can be introduced, and only requests with correct signatures can be accessed. This approach can effectively prevent some simple crawlers from crawling data by directly accessing the interface.

Through these measures, we can effectively deal with illegal citations and robot crawling problems, protect the website's resources and data security, and at the same time maintain search engine friendliness and ensure the normal operation of SEO.

Gateway authentication and user center decoupling

Previously, we discussed how to use the gateway to block harassment from illegal users. In fact, in addition to defending against attacks and preventing resources from being consumed maliciously, the gateway can also help us remove some business dependencies. For example, in the design of user login, each business does not need to directly rely on the user center to verify the legitimacy of the user.

User authentication usually achieves consistent verification logic by integrating the user center SDK in each sub-business. While this approach brings convenience, it also creates new problems: SDK synchronization dependency and upgrade maintenance. Basic public components generally provide convenience for business development through SDK, and if services are only provided through API, some special operations may need to be implemented repeatedly. However, once the SDK is released, we need to be prepared to maintain multiple versions of SDK online at the same time to ensure compatibility and functional stability.

The following figure shows the comparison between using SDK authentication token and authenticating directly through the user center interface:

picture

With the SDK integrated, each business can verify the user's identity by itself without frequently requesting the user center. However, this solution also brings some challenges. Since the SDK is a component embedded in each project, the project usually does not frequently upgrade its version to maintain stability. This makes subsequent upgrades of the user center face resistance, because all dependent businesses must be taken into account during the upgrade. Each large-scale upgrade of basic services requires a lot of manpower to synchronize the SDK, which increases the complexity of maintenance.

In order to solve this coupling problem, we can consider another design idea, which is to put the user login authentication function in the gateway layer. In this way, the business system no longer needs to directly rely on the SDK of the user center, but completes identity authentication and permission verification through the gateway. In this way, the gateway can directly authenticate the user when receiving the request, and only the verified request will be forwarded to the specific business service, thereby decoupling the direct dependency between the user center and various business systems.

The following figure shows the request flow under this design. Please refer to the diagram for reference. I will further analyze its working mechanism and advantages.

picture

Combined with the above figure, let's look at the implementation process of this design. When a user requests a business interface, the gateway will first authenticate the identity of the requesting user. If the authentication is successful, the user information will be put into the header and passed to the backend service. The business API does not need to pay attention to the implementation details of the user center, and can directly obtain the user information from the header to continue working.

If the business requires users to log in before using it, you can add a judgment in the business logic to check whether the request header contains uid. If uid is missing, a unified error code is returned to the front end to prompt the user to log in first. This authentication service design effectively decouples the business module and the user center. Even if the user center logic changes, there is no need to upgrade the business module synchronously.

In addition to basic login authentication, this design can also achieve more flexible permission management at the gateway layer. For example, role-based access control (RBAC) or attribute-based access control (ABAC) can be enabled for certain domain names to tailor permission control policies for different business scenarios. Through the gateway, we can also provide different permissions for different users and support advanced features such as grayscale testing, thereby improving the flexibility and security of the system.

Intranet Gateway Service

Now that we know the two wonderful uses of the external network, let's look at the functions of the internal network. It can provide failure retry service and smooth restart mechanism. Let's look at them separately.

Retry on failure

When a project is released and upgraded or a service fails and restarts, the system may be temporarily unavailable. During this period, if there is a user request, a 504 error may be returned due to the backend not responding, affecting the user experience. To improve the user experience, you can use the automatic retry function of the intranet gateway.

When a request reaches the backend, but the service returns an error such as 500, 403, or 504, the gateway can avoid returning an error immediately. Instead, the gateway can let the request wait for a while and try again later; or directly return the previously cached content. In this way, the business can achieve smooth hot updates, making the service appear more stable, so that users will not obviously perceive the fluctuations during the online upgrade process.

Smooth restart

During the service upgrade process, the smooth restart mechanism can prevent the service process from exiting immediately after receiving the kill signal. The specific approach is to first stop the service from receiving new requests and wait for the currently processed requests to complete. If the request processing times out (for example, more than 10 seconds), the service is forced to exit. This mechanism helps ensure that ongoing requests are properly handled and reduce the impact of service interruptions on users.

Through this mechanism, user request processing will not be interrupted, so that the business transaction being processed can be guaranteed to be complete. Otherwise, it is likely to cause inconsistent business transactions or only half of them to be completed. With this retry and smooth restart mechanism, we can upgrade and release our code online at any time and release new features. However, after turning on this function, some online failures may be blocked. At this time, we can cooperate with the monitoring of the gateway service to help us detect the status of the system.

Comprehensive application of internal and external gateways

First, let's look at the gateway interface cache function, which is to use the gateway to cache some interface return content. It is suitable for use in service degradation scenarios to temporarily alleviate the impact of user traffic or to reduce the impact of intranet traffic. The specific implementation is shown in the following figure:

picture

From the above figure, we can see that the gateway's cache mechanism is usually implemented through temporary cache and TTL (time to live). When a user requests a service interface, if the response of the interface has been cached and the cache has not expired, the gateway will directly return the cached data to the client. This method can significantly reduce the burden on the backend data service.

However, this approach is a trade-off, as it sacrifices strong data consistency in exchange for improved performance. At the same time, the cache mechanism has high performance requirements, and it is necessary to ensure that the gateway cache can handle the high QPS (queries per second) of external traffic. In order to avoid excessive penetration traffic, the cache data can be refreshed regularly through scripts. In this way, when the gateway finds a valid cache, it returns directly; if the cache does not hit, it will request the backend service and cache the result.

This implementation is more flexible than relying solely on cache and can improve data consistency, but it also increases the complexity of development and maintenance, requiring additional code and operations to ensure system stability and data consistency.

picture

Of course, it is recommended that the length of this cached data should not exceed 5KB (10w QPS X 5KB = 488MB/s), because too long data will slow down the response speed of our cache service.

Service Monitoring

Finally, let's discuss the issue of using the gateway for service monitoring. Without link tracking, most system monitoring usually relies on the gateway's logs. By analyzing the HTTP status code in the gateway's access log, we can determine whether the service is running normally. At the same time, combined with the request response time information, we can implement basic system monitoring functions.

Specifically, the gateway's access log records the HTTP status code (such as 200, 500, 404, etc.) and response time of each request. This information can help us monitor the health of the service, such as determining whether there are abnormal error codes (such as 500 errors) or request timeouts, and thus discover potential problems in a timely manner.

The following diagram shows how to monitor the running status of the service through the gateway. Please refer to the diagram for further understanding. I will continue to analyze the details of this process.

picture

In order to more easily judge the status of online services, we can first collect statistics on the information. The specific method is to regularly aggregate the errors in the access logs and summarize the number of request errors of different interfaces. For example, after aggregation, we can get similar data: "500 errors occurred 20 times within 30 seconds, 504 errors occurred 15 times, and the response time of a domain name interface exceeded 1 second 40 times." These statistics help us quickly analyze the health status of the service.

Unlike other monitoring methods, gateway monitoring can cover all businesses. Although the monitoring granularity is coarse, it is still an effective solution. If combined with Trace, we can record the Trace ID in the access log and use these IDs to further troubleshoot the specific cause of the problem. This implementation method has been used in some companies (such as Good Future and Geek Time), which improves the convenience of troubleshooting.

<<:  The Internet is like this: Design of distributed domain name resolution system in G-line data center

>>: 

Recommend

Cool Knowledge: Learn about RF Antennas in One Article

RF Antenna picture An antenna is a device used to...

Why restarting the router frequently makes WiFi faster

Using WiFi to surf the Internet has become an ind...

Want to know about 5G synaesthesia integration? Just read this article

Development Background Synaesthesia integration: ...

Fifteen best practices for a successful data center migration

Data center migrations are often complex and risk...

RabbitMQ communication model work model

Hello everyone, I am Zhibeijun. Today, I will lea...

Is it true or false that 5G brings new business opportunities?

5G is a very popular buzzword recently. According...

We cannot allow "free-network tools" to threaten network information security

Recently, the official website of the Ministry of...

VULTR adds its 30th data center in the world: Osaka, Japan

It has been more than half a year since I last sh...