Here we divide the problems of slow task running into the following categories Queuing due to insufficient resources (usually for annual or monthly subscription projects) 1. Insufficient resourcesGenerally, SQL tasks occupy CPU and Memory resources. For more information, see How to view logview. 1.1 View the duration and execution phase of a job1.2 Waiting for submitting tasksIf "Job Queueing..." is displayed after submitting a task, it may be because other people's tasks have occupied the resources of the resource group, causing your task to be queued. In SubStatusHistory, Waiting for scheduling is the waiting time. 1.3 Insufficient resources after task submission There is another situation here. Although the task can be submitted successfully, the current resource group cannot start all instances at the same time because the required resources are large. As a result, the task has progress but is not executed quickly. This can be observed through the latency chart function in logview. The latency chart can be viewed by clicking the corresponding task in detail. The above figure shows the running status of a task with sufficient resources. You can see that the lower ends of the blue parts are flat, indicating that all instances are started at almost the same time. The lower end of the graph is in a stair-like upward shape, indicating that the task instances are scheduled little by little, and there are not enough resources to run the task. If the task is more important, you can consider adding resources or raising the priority of the task. 1.4 Reasons for insufficient resources1. Check whether the CU is full through the CU Manager, click on the corresponding task point, find the corresponding time to view the status of the job submission Sort by CPU usage (1) If a task occupies a large amount of CU, find the large task and check the logview to find out why it is the case (too many small files or the amount of data really requires so many resources). (2) If the CU usage is even, it means that multiple large tasks are submitted at the same time and the CU resources are fully occupied. 2. Too many small files cause slow cu usage The parallelism of the map stage is based on the shard size of the input file, which indirectly controls the number of workers in each map stage. The default is 256m. If it is a small file, it will be read as a block as shown in the following figure. In the map stage, the i/o bytes of each task of m1 are only 1m or tens of kb, so more than 2500 parallelisms instantly fill up the resources, indicating that there are too many files in the table and small files need to be merged. Merge small files https://help.aliyun.com/knowledge_detail/150531.html?spm=a2c4g.11186623.6.1198.60ea4560Hr5H8d#section-5nj-hoa-d7f 3. Large amount of data leads to full resources You can purchase more resources. If it is a temporary job, you can add the parameter set odps.task.quota.preference.tag=payasyougo; to allow the specified job to temporarily run in the large pay-as-you-go resource pool. 1.5 How to adjust task parallelism MaxCompute's parallelism is automatically inferred based on the input data and task complexity. Generally, it does not need to be adjusted. Ideally, the greater the parallelism, the faster the processing speed. However, for annual and monthly resource groups, the resource group may be full, causing tasks to wait for resources. This will slow down the tasks. Map phase parallelism odps.stage.mapper.split.size : Modifies the input data volume of each Map Worker, that is, the shard size of the input file, thereby indirectly controlling the number of Workers in each Map stage. Unit: MB, the default value is 256 MB Parallelism of reduce odps.stage.reducer.num : Change the number of workers in each Reduce stage odps.stage.num: modifies the concurrency of all workers under the specified MaxCompute task. Its priority is lower than the odps.stage.mapper.split.size, odps.stage.reducer.mem, and odps.stage.joiner.num properties. odps.stage.joiner.num: changes the number of workers in each Join stage. 2. Data skewData skew [Feature] Most instances in a task have ended, but some instances have not ended yet (long tail). As shown in the following figure, most (358) instances have ended, but there are still 18 instances in the Running state. These instances run slowly, which may be because they process a lot of data or they are slow in processing specific data. Solution: https://help.aliyun.com/document_detail/102614.html?spm=a2c4g.11186623.6.1160.28c978569uyE9f 3. Logical Problems This means that the user's SQL or UDF logic is inefficient, or the optimal parameter settings are not used. The phenomenon manifested is that the running time of a task is very long, and the running time of each instance is relatively uniform. The situations here are more diverse, some are indeed complex in logic, and some have a large room for optimization. Data inflation 【Feature】The amount of output data of the task is much larger than the amount of input data. For example, 1G of data becomes 1TB after processing. If 1TB of data is processed in one instance, the operation efficiency will definitely be greatly reduced. The amount of input and output data is reflected in the two items of Task I/O Record and I/O Bytes: Solution: Confirm that the business logic really requires this and increase the parallelism of the corresponding stage UDF execution efficiency is low [Feature] The execution efficiency of a task is low, and the task contains user-defined extensions. Even the UDF execution timeout error is reported: "Fuxi job failed - WorkerRestart errCode:252,errMsg:kInstanceMonitorTimeout, usually caused by bad udf performance". First, determine the location of the UDF. Click on the slow Fuxi task to see whether the operator graph contains the UDF. For example, the following figure shows that there is a Java UDF. You can view the running speed of the operator by viewing the stdout of the fuxi instance in logview. Normally, the speed (records/s) is in the millions or hundreds of thousands. Solution: Check the UDF logic and try to use built-in functions Original link: http://click.aliyun.com/m/1000283552/ |
<<: 9 steps to a trouble-free Wi-Fi upgrade
>>: In order to make your Internet more enjoyable, what have routers experienced over the years?
The number of online 5G users has exceeded 100 mi...
ProfitServer is a Russian hosting company founded...
At present, the best domestic access lines for ov...
Speaking of the Communications Design Institute, ...
In production environments, we often configure VL...
HmbCloud is called Half Moon Bay. According to th...
Bandwidth management involves the strategic alloc...
Recently, the three major operators have announce...
Mobile networks have entered the 5G era, and thei...
Optical fiber is an important part of communicati...
1. Basic Concepts of OSPF OSPF is based on IP pro...
The AlienVPS.net domain name was registered last ...
According to the latest report released by the In...
In 2022, virtual enterprises can achieve digital ...
Recently, MediaTek officially announced that it w...