Programming Practice: How to parse domain names in the program

Programming Practice: How to parse domain names in the program

[[403061]]

This article is reprinted from the WeChat public account "Xiaocai Learns Programming", written by fasionchan. Please contact Xiaocai Learns Programming public account for reprinting this article.

Since domain names are easier to remember than IP addresses, we usually use them to access network services.

If a web application client wants to communicate with the server, it must first query the DNS server for the IP address corresponding to the domain name. For example, when a reader visits my website fasionchan.com, the browser needs to first query the website's IP address based on the domain name, and then communicate with the website's web server.

So, how to implement domain name query through programming? This is an issue that cannot be avoided when developing network applications.

We know that DNS servers and clients communicate using the DNS protocol: the client first sends a request message to the server, and the server encapsulates the query result into a response message and replies to the client. DNS can use UDP or TCP as the transport layer protocol, and the communication port number is 53.

Assuming the client uses the UDP protocol, the steps for a domain name query are as follows:

  1. Create a UDP socket;
  2. Encapsulate the DNS request message, the domain name to be queried is located in the question section;
  3. Send the request message to the DNS server through the UDP socket (the server port is usually 53);
  4. Wait for the server to respond and read the reply message from the UDP socket;
  5. Parse the response message to obtain the query result;
  6. Close the UDP socket;

If every network application needs to encapsulate DNS messages to implement domain name query, it would be too troublesome! For this purpose, the C library provides a series of tool functions. The application only needs to call these tool functions to complete the domain name query without operating the socket or encapsulating the DNS message.

Sample Program

This program calls the C library function gethostbyname to query the domain name specified by the user in the command line parameter:

  1. #include <arpa/inet.h>
  2. #include <netdb.h>
  3. #include <stdio.h>
  4.  
  5. int main( int argc, char *argv[]) {
  6. if (argc != 2) {
  7. fprintf(stderr, "bad arguments" );
  8. return -1;
  9. }
  10.  
  11. char * name = argv[1];
  12. printf( "resolve domain name: %s\n" , name );
  13.  
  14. struct hostent *result = gethostbyname( name );
  15. if (result == NULL ) {
  16. if (h_errno == HOST_NOT_FOUND) {
  17. fprintf(stderr, "Hostname not found!\n" );
  18. }
  19.  
  20. if (h_errno == NO_DATA) {
  21. fprintf(stderr, "No such record\n" );
  22. }
  23.  
  24. if (h_errno == NO_RECOVERY) {
  25. fprintf(stderr, "\n" );
  26. }
  27.  
  28. if (h_errno == TRY_AGAIN) {
  29. fprintf(stderr, "Temporary error occurred, please try again!\n" );
  30. }
  31.  
  32. return -1;
  33. }
  34.  
  35. int i = 0;
  36. while (result->h_addr_list[i] != NULL ) {
  37. printf( "IP: %s\n" , inet_ntoa(*(struct in_addr *)result->h_addr_list[i]));
  38. i++;
  39. }
  40.  
  41. return 0;
  42. }

As the name implies, gethostbyname queries the address of the host based on the domain name, and the result is usually an IP address or IPv6 address.

Please look at line 14 of the program, where the gethostbyname function is called with the domain name to be queried as a parameter; it returns a pointer to a hostent structure, which stores the domain name query result.

Lines 15-33 check the domain name resolution result. A blank value indicates an error. If an error occurs, the error is handled according to the value of h_errno (see below for details).

Lines 35-39 retrieve the query results from the hostent structure and print them to the screen.

So, what does the gethostbyname library function do internally? The answer is not hard to guess. It helps us create a UDP socket, send a DNS request message, and receive and parse the reply message. Taking this program as an example, its execution flow (blue line) is roughly as follows:

Domain name query library function

In fact, the C library provides a series of utility functions for domain name query:

  • gethostbyname, query the specified domain name, the query result is saved in the hostent structure, and the pointer is returned to the caller;
  • gethostbyname_r , same as above, is a thread-safe version and can be used in a multi-threaded environment;
  • gethostbyname2 , same as gethostbyname2 , but supports specifying the query address type through the af parameter;
  • gethostbyname2_r , same as the third one, but a thread-safe version that can be used in a multi-threaded environment;

Take gethostbyname as an example. If the query is successful, it will return a hostent structure pointer, which stores the query result. If the query fails, it will return NULL and save the error in the h_errno global variable. Generally speaking, domain name query errors can be divided into the following situations:

HOST_NOT_FOUND, indicating that the specified host does not exist, that is, the domain name does not exist;

NO_DATA, indicating that there are other records for the domain name, but no address-related records (A or AAAA);

NO_RECOVERY, an unrecoverable error occurred in the domain name server;

TRY_AGAIN, a temporary error that can be recovered by retrying;

When the domain name query fails, the caller must check the h_errno variable and handle it accordingly.

limitation

In application scenarios such as web crawlers and Socks5 proxies, domain name queries are very frequent. At this time, directly using the gethostbyname series of library functions is likely to encounter performance bottlenecks.

On the one hand, the gethostbyname library function creates a UDP socket to communicate with the DNS server every time it queries a domain name. This means that frequent domain name queries are inevitably accompanied by the creation and destruction of a large number of sockets, and the overhead can be imagined!

On the other hand, the gethostbyname library function will block until the DNS server returns a result or the query times out, which will seriously restrict the concurrent processing capability of the system.

Therefore, in high-frequency query scenarios, you cannot directly use library functions such as gethostbyname, and you must use some optimized asynchronous domain name resolution libraries.

Further reading

gethostbyname

<<:  Remote holographic presentation is the development direction in the 5G era, and AR/VR hardware has entered a period of quantitative change

>>:  5G rumors are spreading, and this time it is India that is hurt

Recommend

Ten reasons why traditional routers are abandoned (six, seven, eight)

Over the years, we've dutifully upgraded our ...

From 2G to 5G, three changes in the discourse power of mobile Internet

[[256146]] Hans Vestberg, CEO of Verizon, the lar...

Understand 5G in one article: Will it subvert the sky-high living costs?

When we were still accustomed to browsing the web...