Six good habits and 23 lessons that Linux operation and maintenance must know to avoid pitfalls!

Six good habits and 23 lessons that Linux operation and maintenance must know to avoid pitfalls!

I have been engaged in operation and maintenance for three and a half years and have encountered various problems, including data loss, website Trojan horse, accidental deletion of database files, hacker attacks, etc.

I’ll simply organize it today and share it with you guys.

[[286653]]

1. Online operation specifications

1. Test use

When I was learning how to use Linux, from the basics to services to clusters, everything was done on a virtual machine. Although the teacher told us that there was no difference between it and the real machine, my desire for the real environment was growing. However, the various snapshots of the virtual machine made us develop various bad habits, so that when we got the server operation authority, we couldn't wait to try it. I remember the first day at work, the boss gave me the root password. Since I could only use putty, I wanted to use xshell, so I quietly logged into the server and tried to change to xshell+key login. Because I didn't test it and didn't leave an ssh connection, after restarting the sshd server, I was blocked from the server. Fortunately, I backed up the sshd_config file at the time, and later asked the computer room staff to cp it over. Fortunately, this is a small company, otherwise I would have been fired directly... Fortunately, I was lucky that year.

The second example is about file synchronization. Everyone knows that rsync is very fast, but it deletes files much faster than rm -rf. There is a command in rsync that synchronizes a file based on a certain directory (if the first directory is empty, the result is predictable), and the source directory (with data) will be deleted. At the beginning, I wrote the directory in the wrong way because of misoperation and lack of testing. The key is that there is no backup... The data in the production environment was deleted.

If you don't have a backup, you can think about the consequences yourself. Its importance is self-evident.

2. Double-check before entering

Regarding the error of rm -rf / var, I believe that it is more likely to occur when people are quick or the network speed is slow.

When you find out that it has been executed, your heart will at least be disappointed.

You may say, I have pressed it so many times without any mistakes, don't be afraid, I just want to say

Once it happens, you will understand. Don’t think that those operation and maintenance accidents happen to others. If you don’t pay attention, you may be the next one.

3. Avoid multiple people operating

The operation and maintenance management at my previous company was quite chaotic. To give a typical example, several operation and maintenance executives who had left the company all had the server root password.

Usually when we receive a task, we will conduct a simple check. If we cannot solve it, we will ask others for help. However, when the problem is overwhelming, the customer service manager (who knows a little about Linux), the network administrator, and your boss will debug a server together. When you search on Baidu and compare various things, you find that your server configuration file is different from the last time you modified it. Then you change it back and Google it again. You are excited to find that the problem has been solved, but others tell you that they have also solved it by modifying different parameters... In this case, I really don’t know which one is the real cause of the problem. Of course, this is still good. The problem is solved and everyone is happy. But have you ever encountered a file you just modified, the test is invalid, and you go to modify it again and find that the file has been modified again? It’s really annoying. Avoid multi-person operation.

4. Back up first and then operate

Develop a habit of backing up data before modifying it, such as .conf configuration files

In addition, when modifying the configuration file, it is recommended to comment out the original option, then copy it and modify it.

Furthermore, if there is a database backup in the first example, then the rsync error will be fixed soon.

So losing a database is not a matter of one day or one night, and it won’t be so tragic if you just back up a random one.

2. Data involved

1. Use rm -rf with caution

There are many examples online, such as rm -rf /, deletion of master databases, and operation and maintenance accidents.

A small mistake can lead to big losses. If you really need to delete, be careful.

2. Backup is everything

There are all kinds of backups above, but I want to classify it in the data category to emphasize again that backups are very important.

I remember my teacher once said that when it comes to data, you can never be too cautious.

The company I work for runs a third-party payment website and an online loan platform.

Third-party payment platforms are fully backed up every two hours, and online loan platforms are backed up every 20 minutes.

I won't say much, you guys can decide for yourself

3. Stability is everything

In fact, it is not just data, but the entire server environment, where stability is above all else. We do not seek the fastest, but the most stable and available.

Therefore, do not use new software on the server without testing, such as nginx+php-fpm, because PHP may hang in the production environment.

Just restart it, or change apache.

4. Confidentiality is paramount

Nowadays, there are all kinds of pornographic photos and various router backdoors, so when it comes to data, confidentiality is not an option.

3. Safety

1. ssh

  • Change the default port (of course, if a professional wants to hack you, it will show up after scanning)
  • Disable root login
  • Use ordinary users + key authentication + sudo rules + IP address + user restrictions
  • Use anti-explosion cracking software similar to hostdeny (if more than a few attempts are made, the software will be blocked directly)
  • Filter users logged in in /etc/passwd

2. Firewall

The firewall production environment must be turned on and follow the minimum principle, drop everything, and then release the required service ports.

3. Fine-grained permissions and control granularity

Services that can be started by ordinary users should never use root. Control various service permissions to the minimum, and the control granularity should be fine.

4. Intrusion detection and log monitoring

Use third-party software to constantly monitor changes to key system files and various service configuration files

For example, /etc/passwd, /etc/my.cnf, /etc/httpd/con/httpd.con, etc.;

Use a centralized log monitoring system to monitor /var/log/secure, /etc/log/message, FTP upload and download file alarm error logs, etc.

In addition, for port scanning, you can also use some third-party software, and directly pull in host.deny when you find it has been scanned. This information is very helpful for troubleshooting after the system is hacked. Someone once said that the cost of a company's security investment is proportional to the cost of its loss due to security attacks. Security is a big topic.

It is also a very basic job. If the basics are done well, the system security can be greatly improved. The rest is done by security experts.

4. Daily Monitoring

1. System operation monitoring

Many people start their careers in operation and maintenance by doing monitoring. Large companies generally have professional 24-hour monitoring and operation. System operation monitoring generally includes hardware utilization.

Common ones include memory, hard disk, CPU, network card, OS including login monitoring, system key file monitoring

Regular monitoring can predict the probability of hardware damage and provide very practical functions for tuning.

2. Service operation monitoring

Service monitoring is generally various applications, web, db, lvs, etc., which generally monitor some indicators

When performance bottlenecks occur in the system, they can be quickly discovered and resolved.

3. Log monitoring

The log monitoring here is similar to the security log monitoring, but here are generally hardware, OS, application error and alarm information

Monitoring is really useless when the system is running stably, but once a problem occurs, you will be very passive if you don’t monitor it.

5. Performance Tuning

1. Deepen your understanding of the operating mechanism

Actually, based on more than a year of operation and maintenance experience, talking about tuning is just a theory, but I just want to summarize it briefly. If I have a deeper understanding, I will update it. Before optimizing the software, for example, you need to have a deep understanding of the operating mechanism of a software, such as nginx and apache. Everyone says that nginx is fast, then you must know why nginx is fast, what principle it uses, and how it processes requests better than apache. You must be able to explain it to others in easy-to-understand words, and when necessary, you must be able to understand the source code, otherwise all documents that use parameters as the tuning object are nonsense.

2. Tuning framework and sequence

Once you are familiar with the underlying operating mechanism, you need to have a framework and sequence for tuning. For example, when a database bottleneck occurs, many people directly change the database configuration file. My suggestion is to first analyze the bottleneck, check the logs, write down the tuning direction, and then start. Database server tuning should be the last step. The first step should be hardware and operating system. Today's database servers are released only after various tests.

Applicable to all operating systems, you should not start with it.

3. Adjust only one parameter at a time

Only adjust one parameter at a time. I think everyone knows this. If you adjust too many parameters, you will get confused.

4. Benchmarking

To determine whether the tuning is useful and to test the stability and performance of a new version of the software, benchmark testing is necessary. Testing involves many factors.

Whether the test is close to the actual business needs depends on the experience of the tester. For relevant information, you can refer to the third edition of "High Performance MySQL", which is quite good.

My teacher once said that there is no universal parameter, and any parameter change or tuning must be in line with the business scenario.

So don’t Google for any tuning, it will not have any long-term effect on your improvement or the improvement of your business environment.

6. Operation and maintenance mentality

1. Control your mindset

Many rm -rf /data are done in the last few minutes before get off work, when people are at the peak of irritability. So, don't you plan to control your mentality?

Some people say that you have to go to work even if you are upset, but you can try to avoid dealing with critical data environments when you are upset.

The more pressure you are under, the calmer you should be, otherwise you will lose more.

Most people have the experience of rm -rf /data/mysql. You can imagine how they feel after deleting it. But if there is no backup, what's the point of being anxious? Generally, in this case, you should calm down and think about the worst-case scenario. For mysql, after deleting the physical files, some tables will still exist in the memory, so disconnect the business, but do not close the mysql database, which is very helpful for recovery. Use dd to copy the hard disk, and then you can recover it.

Of course, most of the time you can only find a data recovery company.

Imagine that the data is deleted, and you perform various operations, close the database, and then repair it. Not only is it possible that the file will be overwritten, but you may also not be able to find the table in the memory.

2. Responsibility for data

The production environment is not a joke, and neither is the database. You must be responsible for your data. The consequences of not backing up are very serious.

3. Get to the bottom of things

Many operation and maintenance personnel are busy and will not take care of the problem once it is solved. I remember that last year a customer's website could not be opened all the time. After the PHP code reported an error

I found that session and whos_online were damaged. The previous operator repaired it through repair, so I also repaired it in this way, but after a few hours, it appeared again

After repeating this three or four times, I went to Google to find out the reasons for the inexplicable corruption of the database table: one is a myisam bug, the second is a mysql bug, and the third is that mysql is in the writing process

It was killed, and it was finally found that insufficient memory caused OOM to kill the mysqld process

And there is no swap partition, the background monitoring memory is sufficient, and finally the problem is solved by upgrading the physical memory.

4. Test and production environments

Before performing any important operations, be sure to check the machine you are on and try to avoid opening multiple windows.

<<:  Four network structures of switches: cascading, port aggregation, stacking, and layering

>>:  An overall introduction to the 5G protocol, worth collecting!

Recommend

From 300 million to 600 million, what challenges does 5G user growth still face?

[[419577]] It has been two years since the 5G lic...

Modbus protocol: the cornerstone of industrial communication

In the wave of modern industrial automation, real...

Ma Xiaofang from Xunlei: I yearn for a manager who is like a "stabilizing force"

[51CTO.com original article] In order to pay trib...

Managing a data center requires foresight

More and more businesses are finding that in orde...

Common ways to manage networks through AIOps

NetOps teams in enterprises are faced with the ch...

VXLAN and MPLS: From Data Center to Metro Ethernet

In recent years, the evolution of cloud computing...

5G and Net Zero: Can the Two Overlap?

As COP27 wraps up this year’s agenda, a number of...