1. Can you have both high performance and fast startup speed? As an object-oriented programming language, Java is unique in its performance. The report "Energy Efficiency across Programming Languages, How Does Energy, Time, and Memory Relate?" investigates the execution efficiency of major programming languages. Although the richness of the scenarios is limited, it can also give us a glimpse of the big picture. From the table, we can see that Java has a very high execution efficiency, about half of the fastest C language. This is second only to C, Rust and C++ among mainstream programming languages. Java's excellent performance is due to the excellent JIT compiler in Hotspot. Java's Server Compiler (C2) compiler is the work of Dr. Cliff Click, and uses the Sea-of-Nodes model. This technology has also proven over time that it represents the most advanced level in the industry: The famous V8 (JavaScript engine) TurboFan compiler uses the same design, but implements it in a more modern way; 2. The root cause of slow Java startup 1. Complex framework JakartaEE is the new name after Oracle donated J2EE to the Eclipse Foundation. When Java was launched in 1999, the J2EE specification was released. EJB (Java Enterprise Beans) defines the security, IoC, AOP, transaction, concurrency and other capabilities required for enterprise-level development. The design is extremely complex, and the most basic applications require a large number of configuration files, which is very inconvenient to use. With the rise of the Internet, EJB was gradually replaced by the more lightweight and free Spring framework, and Spring became the de facto standard for Java enterprise development. Although Spring is positioned as more lightweight, it is still largely influenced by JakartaEE, such as the use of a large number of XML configurations in early versions, a large number of JakartaEE-related annotations (such as JSR 330 dependency injection), and the use of specifications (such as JSR 340 Servlet API). But Spring is still an enterprise-level framework. Let's look at some of the design philosophies of the Spring framework: By providing options at every level, Spring lets you defer choices as long as possible. We run a spring-boot-web helloword, and we can see the dependent class files through -verbose:class:
The number of classes reaches an astonishing 7404. Let's compare the JavaScript ecosystem and write a basic application using the commonly used express:
We use Node's debug environment variable analysis:
There are only 55 js files relied on here. Although it is not fair to compare spring-boot and express. In the Java world, applications can also be built based on more lightweight frameworks such as Vert.X and Netty, but in practice, almost everyone will choose spring-boot without hesitation in order to enjoy the convenience of the Java open source ecosystem. 2. Compile once, run everywhere Is Java's slow startup due to a complex framework? The only answer is that the complexity of the framework is one of the reasons for the slow startup. By combining GraalVM's Native Image function with the spring-native feature, the startup time of the spring-boot application can be shortened by about ten times. Java's slogan is "Write once, run anywhere" (WORA), and Java has indeed achieved this through bytecode and virtual machine technology. WORA enables developers to quickly deploy applications developed and debugged on MacOS to Linux servers. The cross-platform nature also makes the Maven central repository easier to maintain, contributing to the prosperity of the Java open source ecosystem. Let's look at the impact of WORA on Java: Class Loading
Each JAR package is a relatively independent module in terms of functionality. Developers can rely on JARs with specific functions as needed. These JARs are known to the JVM through the class path and loaded. According to the JVM, class loading is triggered when the new or invokestatic bytecode is executed. The JVM will hand over control to the Classloader. The most common implementation URLClassloader will traverse the JAR package to find the corresponding class file:
Therefore, the cost of searching for classes is usually proportional to the number of JAR packages. In large applications, the number can be thousands, resulting in a high overall search time. After finding the class file, the JVM needs to verify whether the class file is legal and parse it into an internally usable data structure, which is called InstanceKlass in the JVM. I have heard of javap to take a peek at the information contained in the class file:
This structure contains interfaces, base classes, static data, object layout, method bytecodes, constant pools, etc. These data structures are necessary for the interpreter to execute bytecodes or JIT compilation. Class initialize When the class is loaded, it must be initialized before it can actually create an object or call a static method. Class initialization can be simply understood as a static block:
The initialization of the first static variable JAVA_VERSION_STRING above will also become part of the static block after being compiled into bytecode. Class initialization has the following characteristics: Execute only once; Just In Time compile
We use JMH to run a Hessian serialization Micro Benchmark test:
The -Xint parameter in the second run controls us to use only the interpreter. The difference here is 26 times, which is caused by the difference between direct machine execution and interpreted execution. This difference is very related to the scenario. Our usual experience is 50 times. Let's take a closer look at the JIT's behavior:
Here are the values of two JDK internal JIT parameters. We will not introduce the principle of tiered compilation for now, you can refer to Stack Overflow. Tier3 can be simply understood as (client compiler) C1, and Tier4 is C2. When a method is interpreted and executed 2,000 times, it will be compiled into C1. When C1 is compiled and executed 15,000 times, it will be compiled into C2, which truly achieves half the performance of C mentioned at the beginning of the article. When the application is just started, the method has not been completely JIT compiled, so most of the time it remains in interpreted execution, affecting the speed of application startup. 3. How to optimize the startup speed of Java applications We have spent a lot of time analyzing the main reasons for the slow startup of Java applications. The summary is: Influenced by JakartaEE, common frameworks are designed to be more complex with considerations of reuse and flexibility; Python and Javascript both parse and load modules dynamically. CPyhton does not even have JIT. In theory, its startup will not be much faster than Java, but they do not use very complex application frameworks, so there will be no overall startup performance issues. Although we cannot easily change users' usage habits of the framework, we can enhance it at the runtime level to make the startup performance as close to the native image as possible. The OpenJDK official community has also been working hard to solve the startup performance problem. So, as ordinary Java developers, can we use the latest features of OpenJDK to help us improve startup performance? Class Loading solves the JAR package traversal problem through JarIndex, but this technology is too old and difficult to use in modern projects that include tomcat and fatJar. AppCDS can solve the performance problem of class file parsing. 1 AppCDS CDS (Class Data Sharing) was first introduced in Oracle JDK1.5, and AppCDS was introduced in Oracle JDK8u40, which supports classes outside the JDK, but is provided as a commercial feature. Oracle then contributed AppCDS to the community, and in JDK10, CDS was gradually improved and also supported user-defined class loaders (also known as AppCDS v2). Object-oriented languages bind objects (data) and methods (operations on objects) together to provide stronger encapsulation and polymorphism. These features are implemented by relying on the type information in the object header, which is the case for both Java and Python. The layout of a Java object in memory is as follows:
mark indicates the state of the object, including whether it is locked, GC age, etc. Klass* points to the data structure InstanceKlass that describes the object type:
Based on this structure, expressions such as o instanceof String can be judged with sufficient information. It should be noted that the InstanceKlass structure is relatively complex, including all methods, fields, etc. of the class, and the methods contain bytecodes and other information. This data structure is obtained by parsing the class file at runtime. In order to ensure security, the legitimacy of the bytecode also needs to be verified when parsing the class (method bytecodes not generated by Javac can easily cause JVM crashes). CDS can store (dump) the data structure generated by the parsing and verification into a file and reuse it in the next run. This dump product is called Shared Archive, with the suffix jsa (Java shared archive). In order to reduce the overhead of CDS reading jsa dump and avoid the overhead of deserializing data to InstanceKlass, the storage layout in the jsa file is exactly the same as the InstanceKlass object. In this way, when using jsa data, you only need to map the jsa file to the memory and let the type pointer in the object header point to this memory address, which is very efficient.
AppCDS is not able to handle customer class loader The InstanceKlass stored in jsa is the product of class file parsing. For the boot classloader (the classloader that loads classes under jre/lib/rt.jar) and the system (app) classloader (the classloader that loads classes under -classpath), CDS has an internal mechanism that can skip reading class files and only match the corresponding data structure in the jsa file by class name. Java also provides a mechanism for users to define custom class loaders. Users can highly customize the logic of obtaining classes by overriding their own Classloader.loadClass() method. For example, obtaining classes from the Internet or dynamically generating them directly in the code are both feasible. In order to enhance the security of AppCDS and avoid obtaining unexpected classes due to loading class definitions from CDS, the AppCDS customer class loader needs to go through the following steps: Call the user-defined Classloader.loadClass() to get the class byte stream
The class path contains the three jars above. When loading class com.foo.Foo, most Classloader implementations (including URLClassloader, tomcat, spring-boot) choose the simplest strategy (premature optimization is the root of all evil): try to extract the file com/foo/Foo.class one by one in the order that the jars appear on disk. JAR packages use the zip format for storage. Each class load needs to traverse the JAR packages under the classpath and try to extract a single file from the zip to ensure that the existing class can be found. Assuming there are N JAR packages, then on average one class load needs to try to access N/2 zip files. In one of our real scenarios, N reaches 2000. At this time, the JAR package search overhead is very large and far greater than the InstanceKlass resolution overhead. AppCDS technology is unable to cope with such scenarios. JAR Index According to the jar file specification, the JAR file is a format that uses zip packaging and uses text to store meta information in the META-INF directory. This format has been designed to handle the above search scenarios, and this technology is called JAR Index. Suppose we want to search for a class in the above bar.jar, baz.jar, and foo.jar. If we can immediately infer which jar package it is in through the type com.foo.Foo, we can avoid the above scanning overhead. JarIndex-Version: 1.0foo.jarcom/foobar.jarcom/barbaz.jarcom/baz com/bar --> bar.jarcom/baz --> baz.jarcom/foo --> foo.jar The Jar Index technique seems to solve our problem, but it is very old and difficult to use in modern applications: jar i generates index files based on the Class-Path attribute in META-INF/MANIFEST.MF. Modern projects rarely maintain this attribute. Only URLClassloader supports JAR Index. 2 types of early initialization The execution of code in the static block of a class is called class initialization. After the class is loaded, the initialization code must be executed before it can be used (creating an instance, calling static methods). The initialization of many classes is essentially just constructing some static fields:
We know that JDK caches a commonly used section in the box type to avoid too much repeated creation, so this section of data needs to be constructed in advance. Since these methods will only be executed once, they are executed in a purely interpreted manner. If we can persist several static fields to avoid calling the class initializer, we can get the pre-initialized class and reduce the startup time. The most efficient way to load persistent data into memory is memory mapping:
The C language operates data almost directly on memory, while high-level languages such as Java abstract memory into objects with meta-information such as mark and Klass*. There are certain changes between each run, so more complex strategies are needed to achieve efficient object persistence. Introduction to Heap Archive OpenJDK9 introduced the HeapArchive capability, and heap archive was officially used in OpenJDK12. As the name suggests, Heap Archive technology can store objects on the heap persistently. The object graph is built in advance and put into the archive. We call this stage dump, and using the data in the archive is called runtime. Dump and runtime are usually not the same process, but in some scenarios they can be the same process. Recall the memory layout after using AppCDS. The Klass* pointer of the object points to the data in SharedArchive. AppCDS persists the metadata of InstanceKlass. If you want to reuse the persistent object, the type pointer of the object header must also point to a piece of persisted metadata. Therefore, HeapArchive technology depends on AppCDS. In order to adapt to various scenarios, OpenJDK's HeapArchive also provides two levels: Open and Closed: The above figure shows the allowed reference relationships: Closed Archive is not allowed to reference objects in Open Archive and Heap. Objects in Closed Archive can be referenced. Objects in Closed Archive are read-only and not writable. Why read-only? Imagine that object A in the Closed Archive references object B in the heap. When object B moves, GC needs to correct the field in A that points to B, which will incur GC overhead. Use Heap Archive to initialize classes in advance With this structure supported, after the class is loaded, the static variable is pointed to the Archived object to complete the class initialization:
3 AOT Compilation Apart from class loading, the first few executions of the method are not compiled by the JIT compiler, and the bytecode is executed in interpreted mode. According to the analysis in the first half of this article, the speed of interpreted execution is about a few tenths of that after JIT compilation. The slow execution of code interpretation is also a major culprit for the slow startup. Traditional languages such as C/C++ are directly compiled into the native machine code of the target platform. As people become aware of the startup warm-up problem of interpreter JIT languages such as Java and JS, the method of directly compiling bytecode into native code through AOT has gradually come into the public eye. wasm, GraalVM, and OpenJDK all support AOT compilation to varying degrees. We mainly optimize the startup speed around the jaotc tool introduced by JEP295. Note the terminology used here: AOT feature first experience With the introduction of JEP295, we can quickly experience AOT The jaotc command will call the Graal compiler to compile the bytecode and generate the libHelloWorld.so file. The so file generated here can easily lead people to mistakenly think that the compiled library code will be directly called like JNI. However, the ld loading mechanism is not fully used here to run the code. The so file is more like a container for native code. After loading the AOT so, the hotsopt runtime needs to perform further dynamic linking. After the class is loaded, hotspot will automatically associate the AOT code entry and use the AOT version for the next method call. The code generated by AOT will also actively interact with the hotspot runtime, jumping between aot, interpreter, and JIT code. 1) The twists and turns of AOT It seems that JEP295 has implemented a complete AOT system, but why is this technology not used on a large scale? Among the new features of OpenJDK, AOT has a troubled fate. 2) Multiple Classloader Issues JDK-8206963: bug with multiple class loaders This is because the design does not take into account the multi-classloader scenario of Java. When classes with the same name loaded by multiple classloaders use AOT, their static fields are shared, while according to the design of the Java language, this part of the data should be separated. Since there is no quick fix for this problem, OpenJDK simply added the following code:
AOT is not allowed for user-defined class loaders. From this, we can see that this feature has gradually lacked maintenance at the community level. In this case, although the classes specified by class-path can still use AOT, the commonly used frameworks such as spring-boot and Tomcat need to load the application code through Custom Classloader. It can be said that this change cuts off a large part of AOT scenarios. 3) Lack of tuning and maintenance, relegated to experimental features JDK-8227439: Turn off AOT by default JEP 295 AOT is still experimental, and while it can be useful for startup/warmup when used with custom generated archives tailored for the application, experimental data suggests that generating shared libraries at a module level has overall negative impact to startup, dubious efficacy for warmup and severe static footprint implications. To enable AOT from now on, you need to add the experimental parameter: java -XX:+UnlockExperimentalVMOptions -XX:AOTLibrary=... The Java language itself is too complex, and runtime mechanisms such as dynamic class loading prevent AOT code from running as fast as expected. 4) Deleted in JDK16 JDK-8255616: Disable AOT and Graal in Oracle OpenJDK On the eve of the release of OpenJDK16, Oracle officially decided not to maintain this technology: We haven't seen much use of these features, and the effort required to support and enhance them is significant. The fundamental reason is that this technology lacks necessary optimization and maintenance. As for the future plans related to AOT, we can only speculate from a few words that there are two technical directions for Java AOT in the future: Do AOT based on OpenJDK's C2 Support full Java language features on GraalVM's native-image. Users who need AOT will gradually transition from OpenJDK to native-image. 5) Fast Startup on Dragonwell Dragonwell's fast startup feature addresses the weaknesses of AppCDS and AOT compilation technologies, and develops a class early initialization feature based on the HeapArchive mechanism. These features almost completely eliminate the time spent on application startup visible to the JVM. In addition, because the above technologies all conform to the trace-dump-replay usage model, Dragonwell unifies the processes of the above startup acceleration technologies and integrates them into the SAE product. SAE x Dragonwell: Best Practices for Serverless with Java Startup Acceleration With good ingredients, you also need matching seasonings and a master chef. The combination of Dragonwell's startup acceleration technology and the Serverless technology known for its elasticity is more complementary. At the same time, they are jointly implemented in the full life cycle management of microservice applications to play their role in shortening the end-to-end startup time of applications. Therefore, Dragonwell chose SAE to implement its startup acceleration technology. SAE (Serverless Application Engine) is the first PaaS platform for Serverless. It can: Java software package deployment: Enjoy microservice capabilities with zero code transformation and reduce R&D costs 1 Difficulty Analysis Through analysis, we found that users of microservices face some difficulties in application startup: Large software package: hundreds of MB or even GB level Many dependent packages: hundreds of dependent packages, thousands of classes Java environment + JAR/WAR package deployment: Integrate Dragonwell 11 to provide an accelerated startup environment 2 Acceleration effect We selected some typical demos or internal applications of microservices and complex dependent business scenarios to test the startup effect and found that the application can generally reduce the startup time by 5% to 45%. If the application is started and the following scenarios exist, there will be a significant acceleration effect: Many classes are loaded (spring-petclinic starts loading about 12,000+ classes) Less reliance on external data 3 Customer Cases Alibaba Search recommends Serverless platform Alibaba's internal search recommendation Serverless platform uses a class loading isolation mechanism to deploy multiple businesses in the same Java virtual machine. The scheduling system will deploy business codes to idle containers on demand, allowing multiple businesses to share the same resource pool, greatly improving deployment density and overall CPU usage. In order to support a large number of different business development operations, the platform itself needs to provide rich enough functions, such as caching and RPC calls. Therefore, each JVM of the search recommendation Serverless platform needs to pull up a middleware isolation container similar to Pandora Boot, which will load a large number of classes and slow down the startup speed of the platform itself. When sudden demand comes in, the scheduling system needs to pull up more containers for business code deployment, and the startup time of the container itself becomes particularly important. Based on Dragonwell's fast startup technology, the search recommendation platform will perform optimizations such as AppCDS and Jarindex in the pre-release environment, and embed the generated archive files into the container image, so that each container can enjoy acceleration when it starts, reducing the startup time by about 30%. Fashion brand kills SAE extreme elasticity An external customer, with the help of Jar package deployment and Dragonwell 11 provided by SAE, quickly iterated and launched a trendy shopping mall App. When facing big promotions and flash sales, with the extreme elasticity of SAE Serverless and the elasticity of application indicators QPS RT indicators, we can easily meet the demand for rapid expansion of more than 10 times; at the same time, one-click to turn on the Dragonwell enhanced AppCDS startup acceleration capability, reducing the startup time of Java applications by more than 20%, further accelerating application startup, and ensuring smooth and healthy business operation. 5. Conclusion The quick start technology of Dragonwell is completely based on the work of the OpenJDK community. It has made detailed optimizations and bug fixes for various functions and reduced the difficulty of getting started. This not only ensures compatibility with standards and avoids internal customization, but also contributes to the open source community. As basic software, Dragonwell can only generate/use archive files on disk. Combined with SAE's seamless integration of Dragonwell, JVM configuration and archive file distribution are automated. Customers can easily enjoy the technical benefits brought by application acceleration. |
<<: What can 5G messaging bring to industry customers?
>>: How to fight the emergency communication battle in the “golden 72 hours”?
1. Distribution of the global optical communicati...
[[240625]] Poor WiFi signal is a headache for man...
The number of terminal connections exceeds 180 mi...
[[357394]] Processes and threads are two topics t...
"Increase the speed and fee reduction of the...
December 12 morning news (Jiang Junmu) The global...
With the rise of emerging technologies such as cl...
5G is the next generation of wireless broadband t...
The popularity of 5G networks, 5G mobile phones, ...
Enterprises around the world are rapidly transfor...
Manufacturing processes and operations are underg...
Previously, a joke mocking the operators caused a...
spinservers is offering a 50% discount coupon for...