Have you ever encountered these headache-inducing scenarios? - I was woken up by a call from operations at 3am: "Online service response is extremely slow!"
- A big promotional event suddenly crashed: "CPU 100%, the server can't hold up!"
- Received a user complaint: "Why does the system become slower and slower as time goes by?"
As a server-side engineer, these are challenges we must face and solve. But don't worry! Through this practical guide, you will learn: - Practical tips for quickly locating performance bottlenecks
- Practical experience in dealing with high-concurrency scenarios
- Essential tools for system tuning and troubleshooting
- Solutions to common problems such as memory leaks
Let's start practicing and turn each skill into your "killer skill"! Log analysis tips: Safety first! Hello everyone! Today I'm going to teach you a super important tip - "take your temperature" before touching your log! Let's first take a look at how heavy the log file is: $ ls -lh /var/log/nginx/access.log │ │ │ │ │ └── 要查看的文件路径📁 │ └──── h表示human readable,让文件大小更易读👀 └────── l表示long format,显示详细信息📋 -rw-r--r-- 1 nginx nginx 6.5M Mar 20 15:00 access.log # 哇!这个日志有点重量级!🏋️♂️ Output explanation: - -rw-r--r-- : File permissions (read and write permissions)
- nginx nginx: File owner and group
- 6.5M: File size (displayed in human-readable format)
- Mar 20 15:00: Last modified time
Why do this? Because... - Cat-ing large files is like eating an elephant in one go.
- The server will be exhausted and gasping for breath
- It may prevent other friends from accessing the website.
If you find that the log file is too large, we have a little trick: # 把大象搬到别的地方慢慢吃🚚 $ scp /var/log/nginx/access.log test-server:/tmp/ │ │ │ │ │ │ │ └── 目标路径:文件将被复制到这里📁 │ │ └── 目标服务器:可以是主机名或IP地址🖥️ │ └── 源文件:要复制的日志文件路径📄 └──── scp命令:secure copy,安全复制协议🔒 Detailed explanation of scp command parameters: - -r: copy the entire directory and its contents
- -P: Specify the SSH port number (uppercase P)
- -i: Use the specified private key file
- -v: Display detailed transfer process
- -p: Keep the modification time and permissions of the original file
Example of use: # 使用指定端口复制文件$ scp -P 2222 access.log test-server:/tmp/ # 使用2222端口🔌 # 使用私钥文件$ scp -i ~/.ssh/id_rsa access.log test-server:/tmp/ # 指定私钥🔑 # 复制整个目录$ scp -r /var/log/nginx/ test-server:/backup/ # 复制整个目录📂 # 保留文件属性$ scp -p access.log test-server:/tmp/ # 保留时间和权限⏰ Tips: Things to note when using scp - Make sure the target server has enough disk space
- Check if the network connection is stable
- Pay attention to file permission settings
- When transferring large files, it is recommended to use the -C parameter to compress the transfer.
Want to sneak a peek at the last few lines of a log? Try this: $ tail -n 5 access.log │ │ │ │ │ └── 要查看的日志文件📄 │ └──── 显示的行数(这里是5行)📏 └──────── tail命令:查看文件末尾内容📌 192.168.1.100 GET /api/users 200 # 成功啦!🎉 192.168.1.101 POST /api/login 401 # 哎呀,登录失败了😅 # ... 更多访问记录... Detailed explanation of tail command parameters: - -n: Specify the number of lines to display
- -f: Real-time monitoring of file changes (follow mode)
- -F: Similar to -f, but will retry after the file is deleted
- -q: Do not display the file name header
- -v: Display detailed file name header
Example of use: # 显示最后10行(默认) $ tail access.log # 查看最新10条记录📜 # 实时监控日志更新$ tail -f access.log # 像看电影一样实时观察🎬 # 同时监控多个文件$ tail -f access.log error.log # 多文件同步监控👥 # 显示文件末尾100字节$ tail -c 100 access.log # 按字节查看📊 Tips: tail command usage tips - When using -f monitoring, press Ctrl+C to exit
- Use grep to filter specific content
- You can use -n +1 to display the file from the beginning.
- It is recommended to use tail instead of cat for large files
Remember: be gentle with your journal, and your journal will be gentle with you! Let’s see who are the most active visitors: $ cat access.log | awk '{print $1}' | sort | uniq -c | sort -nr | head -3 156 192.168.1.100 # 这位可真是个忠实用户!🥇 89 192.168.1.101 # 二等奖也不错哦!🥈 67 192.168.1.102 # 铜牌得主继续加油!🥉 Let's take a peek at the server's little secrets: $ top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1234 root 20 0 985148 65404 31220 S 25.0 3.2 12:34 nginx 🏃♂️ 5678 mysql 20 0 1258216 326892 8216 S 15.0 8.1 5:67 mysql 🎲 9012 redis 20 0 162672 29684 4508 S 5.0 0.7 2:45 redis 🔄 # 看!每个进程都在忙着工作呢!💪 Isn’t it more interesting to look at the output this way? There is a little story behind each number waiting for you to discover! Check the server status: $ top It's like taking your server's temperature: - PID: ID number of each process
- CPU%: the "body temperature" of the process
- MEM%: the "meat consumption" of the process
- COMMAND: The "name" of the process
Tip: Memorizing these commands is as fun as collecting Pokémon! - Each command has special skills
- Combination is more powerful
- Practice makes perfect, practice more!
See! Isn't it easier to understand what each command does now? Keep it up, you are already a little O&M expert! Response time analysis: from ordering to serving Let's analyze the response time with a professional tool: # 测试网站响应时间$ curl -w "\n⏱️ 总耗时: %{time_total}秒\n" -s https://example.com ⏱️ 总耗时: 0.235秒# 详细的性能分析$ ab -n 100 -c 10 https://example.com/测试结果📊: - 平均响应: 0.389秒⚡ - 成功率: 98% ✅ - 错误数: 2 ⚠️ Error log analysis: Find and resolve problems When a problem occurs in the system, error logs are our good helper. Let's learn some practical log analysis commands: # 查看错误日志$ grep ERROR /var/log/app.log │ │ │ │ │ └── 要搜索的日志文件路径📄 │ └──── 要搜索的关键词🔍 └──────── grep命令:在文件中搜索文本🔎 [ERROR] 2024-03-20 15:00:23 数据库连接超时⚠️ [ERROR] 2024-03-20 15:00:25 内存不足💥 Detailed explanation of grep command parameters: - -i: Ignore case
- -n: Display line numbers
- -r: recursively search directories
- -v: Display non-matching lines
- -c: only display the number of matching lines
Example of use: # 显示行号$ grep -n ERROR /var/log/app.log # 知道错误在第几行📑 # 忽略大小写搜索$ grep -i error /var/log/app.log # 匹配ERROR、error 等🔤 # 递归搜索所有日志文件$ grep -r ERROR /var/log/ # 搜索整个日志目录📂 # 统计错误次数$ grep -c ERROR /var/log/app.log # 只显示错误数量🔢 Let's see how to count error types: $ grep ERROR /var/log/app.log | awk '{print $4}' | sort | uniq -c | sort -nr │ │ │ │ │ │ │ │ │ │ │ └────── 统计出现次数📊 │ │ │ └── 去重🎯 │ │ └── 排序📋 │ └── 提取第4列(错误类型)✂️ └── 过滤出错误日志🔍 15 数据库超时📊 # 最常见的错误8 内存不足📈 3 网络异常📉 Tip: Best practices for analyzing error logs - Check error logs regularly to detect problems early
- Use the -C parameter of grep to view the error context
- Analyze the error occurrence pattern based on timestamp
- Create error type statistics report to find common problems
Error log analysis flow chart: 获取日志📄 --> 过滤错误🔍 --> 分析原因🤔 --> 解决问题✅
Interview points: Log analysis skills Most popular questions asked by interviewers: - How to quickly locate performance issues
# 组合使用多个工具$ dstat -cdngy 1 # 实时监控系统资源📊 │ │││││ │ │ ││││└─ y: 系统统计信息📈 │ │││└── g: 显示页面统计信息📑 │ ││└─── n: 网络统计信息🌐 │ │└──── d: 磁盘统计信息💾 │ └───── c: CPU 统计信息💻 └───────── 1: 每秒更新一次⏱️ $ iotop # 监控磁盘I/O 💾 PID USER IO> DISK READ DISK WRITE COMMAND 1234 mysql 2.1 50.2 M/s 10.1 M/s mysqld 5678 nginx 0.8 2.1 M/s 1.2 M/s nginx $ netstat -antp # 查看网络连接🌐 │ │││└── p: 显示进程信息👥 │ ││└─── t: 只显示TCP连接🔌 │ │└──── n: 显示数字地址而不是主机名🔢 │ └───── a: 显示所有连接🌍 └─────────── 查看网络统计信息📊 - How to handle large log files?
# 使用高效的日志分析方法$ zcat large.log.gz | grep ERROR | tail -n 100 │ │ │ │ │ │ │ │ │ │ │ └── 显示行数📏 │ │ │ │ └──── 查看末尾📌 │ │ │ └────────── 过滤ERROR关键词🔍 │ │ └─────────────── 管道传递输出📤 │ └─────────────────── 压缩文件分隔符| └────────────────────────────── 读取压缩文件📦 $ awk '/ERROR/ {print $4}' large.log | sort | uniq -c │ │ │ │ │ │ │ │ │ │ │ │ │ └── 计数🔢 │ │ │ │ │ └──── 去重🎯 │ │ │ │ └────── 排序📋 │ │ │ └──────────── 打印第4列✂️ │ │ └────────────────── 执行的动作🎬 │ └─────────────────────── 匹配模式🔍 └───────────────────────── 文本处理工具🛠️ Performance analysis tool comparison chart: dstat ➡️ 系统整体状况📊 ├── CPU使用率💻 ├── 磁盘I/O 💾 ├── 网络流量🌐 └── 内存使用🧠 iotop ➡️ 磁盘I/O详情💾 ├── 读取速度📥 ├── 写入速度📤 └── 进程信息👥 netstat ➡️ 网络连接状态🌐 ├── TCP/UDP连接🔌 ├── 端口占用🚪 └── 进程信息👥 Tip: Performance analysis best practices - First use dstat to obtain the overall system status
- Use iotop to conduct in-depth analysis when I/O anomalies are found
- Use netstat to troubleshoot network problems
- Pay attention to collecting enough sample data
- Establish benchmark data for comparative analysis
Common performance issues and solutions: (1) High CPU usage - Use top to find high load processes
- Analyze whether the process has an infinite loop
- Consider adding more CPU cores or optimizing the code
(2) Disk I/O bottleneck - Use iotop to monitor disk reads and writes
- Check whether there are a large number of small file operations
- Consider using SSD or optimizing storage strategy
(3) High network latency - Use netstat to check the connection status
- Analyze whether network packets are lost
- Consider optimizing network configuration or increasing bandwidth
|