Online troubleshooting guide: The ultimate way to bring your server back to life

Have you ever encountered these headache-inducing scenarios?

I was woken up by a call from operations at 3am: "Online service response is extremely slow!"
A big promotional event suddenly crashed: "CPU 100%, the server can't hold up!"
Received a user complaint: "Why does the system become slower and slower as time goes by?"

As a server-side engineer, these are challenges we must face and solve. But don't worry!

Through this practical guide, you will learn:

Practical tips for quickly locating performance bottlenecks
Practical experience in dealing with high-concurrency scenarios
Essential tools for system tuning and troubleshooting
Solutions to common problems such as memory leaks

Let's start practicing and turn each skill into your "killer skill"!

Log analysis tips: Safety first!

Hello everyone! Today I'm going to teach you a super important tip - "take your temperature" before touching your log!

Let's first take a look at how heavy the log file is:

 $ ls -lh /var/log/nginx/access.log │ │ │ │ │ └── 要查看的文件路径📁 │ └──── h表示human readable，让文件大小更易读👀 └────── l表示long format，显示详细信息📋 -rw-r--r-- 1 nginx nginx 6.5M Mar 20 15:00 access.log # 哇！这个日志有点重量级！🏋️♂️

Output explanation:

-rw-r--r-- : File permissions (read and write permissions)
nginx nginx: File owner and group
6.5M: File size (displayed in human-readable format)
Mar 20 15:00: Last modified time

Why do this? Because...

Cat-ing large files is like eating an elephant in one go.
The server will be exhausted and gasping for breath
It may prevent other friends from accessing the website.

If you find that the log file is too large, we have a little trick:

 # 把大象搬到别的地方慢慢吃🚚 $ scp /var/log/nginx/access.log test-server:/tmp/ │ │ │ │ │ │ │ └── 目标路径：文件将被复制到这里📁 │ │ └── 目标服务器：可以是主机名或IP地址🖥️ │ └── 源文件：要复制的日志文件路径📄 └──── scp命令：secure copy，安全复制协议🔒

Detailed explanation of scp command parameters:

-r: copy the entire directory and its contents
-P: Specify the SSH port number (uppercase P)
-i: Use the specified private key file
-v: Display detailed transfer process
-p: Keep the modification time and permissions of the original file

Example of use:

 # 使用指定端口复制文件$ scp -P 2222 access.log test-server:/tmp/ # 使用2222端口🔌 # 使用私钥文件$ scp -i ~/.ssh/id_rsa access.log test-server:/tmp/ # 指定私钥🔑 # 复制整个目录$ scp -r /var/log/nginx/ test-server:/backup/ # 复制整个目录📂 # 保留文件属性$ scp -p access.log test-server:/tmp/ # 保留时间和权限⏰

Tips: Things to note when using scp

Make sure the target server has enough disk space
Check if the network connection is stable
Pay attention to file permission settings
When transferring large files, it is recommended to use the -C parameter to compress the transfer.

Want to sneak a peek at the last few lines of a log? Try this:

 $ tail -n 5 access.log │ │ │ │ │ └── 要查看的日志文件📄 │ └──── 显示的行数（这里是5行）📏 └──────── tail命令：查看文件末尾内容📌 192.168.1.100 GET /api/users 200 # 成功啦！🎉 192.168.1.101 POST /api/login 401 # 哎呀，登录失败了😅 # ... 更多访问记录...

Detailed explanation of tail command parameters:

-n: Specify the number of lines to display
-f: Real-time monitoring of file changes (follow mode)
-F: Similar to -f, but will retry after the file is deleted
-q: Do not display the file name header
-v: Display detailed file name header

Example of use:

 # 显示最后10行（默认） $ tail access.log # 查看最新10条记录📜 # 实时监控日志更新$ tail -f access.log # 像看电影一样实时观察🎬 # 同时监控多个文件$ tail -f access.log error.log # 多文件同步监控👥 # 显示文件末尾100字节$ tail -c 100 access.log # 按字节查看📊

Tips: tail command usage tips

When using -f monitoring, press Ctrl+C to exit
Use grep to filter specific content
You can use -n +1 to display the file from the beginning.
It is recommended to use tail instead of cat for large files

Remember: be gentle with your journal, and your journal will be gentle with you!

Let’s see who are the most active visitors:

 $ cat access.log | awk '{print $1}' | sort | uniq -c | sort -nr | head -3 156 192.168.1.100 # 这位可真是个忠实用户！🥇 89 192.168.1.101 # 二等奖也不错哦！🥈 67 192.168.1.102 # 铜牌得主继续加油！🥉

Let's take a peek at the server's little secrets:

 $ top PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1234 root 20 0 985148 65404 31220 S 25.0 3.2 12:34 nginx 🏃♂️ 5678 mysql 20 0 1258216 326892 8216 S 15.0 8.1 5:67 mysql 🎲 9012 redis 20 0 162672 29684 4508 S 5.0 0.7 2:45 redis 🔄 # 看！每个进程都在忙着工作呢！💪

Isn’t it more interesting to look at the output this way? There is a little story behind each number waiting for you to discover!

Check the server status:

 $ top

It's like taking your server's temperature:

PID: ID number of each process
CPU%: the "body temperature" of the process
MEM%: the "meat consumption" of the process
COMMAND: The "name" of the process

Tip: Memorizing these commands is as fun as collecting Pokémon!

Each command has special skills
Combination is more powerful
Practice makes perfect, practice more!

See! Isn't it easier to understand what each command does now? Keep it up, you are already a little O&M expert!

Response time analysis: from ordering to serving

Let's analyze the response time with a professional tool:

 # 测试网站响应时间$ curl -w "\n⏱️ 总耗时: %{time_total}秒\n" -s https://example.com ⏱️ 总耗时: 0.235秒# 详细的性能分析$ ab -n 100 -c 10 https://example.com/测试结果📊: - 平均响应: 0.389秒⚡ - 成功率: 98% ✅ - 错误数: 2 ⚠️

Error log analysis: Find and resolve problems

When a problem occurs in the system, error logs are our good helper. Let's learn some practical log analysis commands:

 # 查看错误日志$ grep ERROR /var/log/app.log │ │ │ │ │ └── 要搜索的日志文件路径📄 │ └──── 要搜索的关键词🔍 └──────── grep命令：在文件中搜索文本🔎 [ERROR] 2024-03-20 15:00:23 数据库连接超时⚠️ [ERROR] 2024-03-20 15:00:25 内存不足💥

Detailed explanation of grep command parameters:

-i: Ignore case
-n: Display line numbers
-r: recursively search directories
-v: Display non-matching lines
-c: only display the number of matching lines

Example of use:

 # 显示行号$ grep -n ERROR /var/log/app.log # 知道错误在第几行📑 # 忽略大小写搜索$ grep -i error /var/log/app.log # 匹配ERROR、error 等🔤 # 递归搜索所有日志文件$ grep -r ERROR /var/log/ # 搜索整个日志目录📂 # 统计错误次数$ grep -c ERROR /var/log/app.log # 只显示错误数量🔢

Let's see how to count error types:

 $ grep ERROR /var/log/app.log | awk '{print $4}' | sort | uniq -c | sort -nr │ │ │ │ │ │ │ │ │ │ │ └────── 统计出现次数📊 │ │ │ └── 去重🎯 │ │ └── 排序📋 │ └── 提取第4列（错误类型）✂️ └── 过滤出错误日志🔍 15 数据库超时📊 # 最常见的错误8 内存不足📈 3 网络异常📉

Tip: Best practices for analyzing error logs

Check error logs regularly to detect problems early
Use the -C parameter of grep to view the error context
Analyze the error occurrence pattern based on timestamp
Create error type statistics report to find common problems

Error log analysis flow chart:

获取日志📄 --> 过滤错误🔍 --> 分析原因🤔 --> 解决问题✅

Interview points: Log analysis skills

Most popular questions asked by interviewers:

How to quickly locate performance issues

 # 组合使用多个工具$ dstat -cdngy 1 # 实时监控系统资源📊 │ │││││ │ │ ││││└─ y: 系统统计信息📈 │ │││└── g: 显示页面统计信息📑 │ ││└─── n: 网络统计信息🌐 │ │└──── d: 磁盘统计信息💾 │ └───── c: CPU 统计信息💻 └───────── 1: 每秒更新一次⏱️ $ iotop # 监控磁盘I/O 💾 PID USER IO> DISK READ DISK WRITE COMMAND 1234 mysql 2.1 50.2 M/s 10.1 M/s mysqld 5678 nginx 0.8 2.1 M/s 1.2 M/s nginx $ netstat -antp # 查看网络连接🌐 │ │││└── p: 显示进程信息👥 │ ││└─── t: 只显示TCP连接🔌 │ │└──── n: 显示数字地址而不是主机名🔢 │ └───── a: 显示所有连接🌍 └─────────── 查看网络统计信息📊

How to handle large log files?

 # 使用高效的日志分析方法$ zcat large.log.gz | grep ERROR | tail -n 100 │ │ │ │ │ │ │ │ │ │ │ └── 显示行数📏 │ │ │ │ └──── 查看末尾📌 │ │ │ └────────── 过滤ERROR关键词🔍 │ │ └─────────────── 管道传递输出📤 │ └─────────────────── 压缩文件分隔符| └────────────────────────────── 读取压缩文件📦 $ awk '/ERROR/ {print $4}' large.log | sort | uniq -c │ │ │ │ │ │ │ │ │ │ │ │ │ └── 计数🔢 │ │ │ │ │ └──── 去重🎯 │ │ │ │ └────── 排序📋 │ │ │ └──────────── 打印第4列✂️ │ │ └────────────────── 执行的动作🎬 │ └─────────────────────── 匹配模式🔍 └───────────────────────── 文本处理工具🛠️

Performance analysis tool comparison chart:

 dstat ➡️ 系统整体状况📊 ├── CPU使用率💻 ├── 磁盘I/O 💾 ├── 网络流量🌐 └── 内存使用🧠 iotop ➡️ 磁盘I/O详情💾 ├── 读取速度📥 ├── 写入速度📤 └── 进程信息👥 netstat ➡️ 网络连接状态🌐 ├── TCP/UDP连接🔌 ├── 端口占用🚪 └── 进程信息👥

Tip: Performance analysis best practices