Learn more about BeautifulSoup ScrapingThe first article above is the first part of the detailed explanation of the basic knowledge of BeautifulSoup crawler. It mainly introduces the installation process and introduction of BeautifulSoup crawler. At the same time, we quickly learned the relevant knowledge points of using BeautifulSoup technology to locate tags and obtain tag content. Today's article will deeply introduce the detailed syntax of BeautifulSoup technology and its related usage. 1. BeautifulSoup ObjectBeautifulSoup converts complex HTML documents into a tree structure, where each node is a Python object. The official BeautifulSoup documentation summarizes all objects into the following four types:
Next, we will introduce the four objects of BeautifulSoup in detail: Tags The Tag object represents a tag in an XML or HTML document. In layman's terms, it is a tag in HTML. This object is the same as the tag in the HTML or XML native document. Tag has many methods and properties. It is defined as soup.Tag in BeautifulSoup, where Tag is a tag in HTML, such as a, title, etc. The result returns the complete tag content, including the attributes and content of the tag. For example, the following example is Tag:
In the above HTML code, title and p are tags, and the content between the start tag and the end tag is the tag. The code for obtaining the tag is as follows:
In addition, the most important attributes of Tag are name and attrs.
The name attribute is used to get the tag name of the document tree. If you want to get the name of the title tag, just use the soup.title.name code. For internal tags, the output value is the name of the tag itself.
The above example has two attributes, one is the class attribute, the corresponding value is "xiaodu"; the other is the id attribute, the corresponding value is "l1". The tag attribute operation method is the same as the Python dictionary. The code for getting all the attributes of the p tag is as follows, and a dictionary type value is obtained, which gets the attributes and attribute values of the first paragraph p.
Each tag in BeautifulSoup may have many attributes, which can be obtained through ".attrs". The attributes of the tag can be modified, deleted or added. NavigableString NavigableString is also called traversable string. Strings are often contained in tags. BeautifulSoup uses the NavigableString class to wrap the string in the tag. BeautifulSoup uses the NavigableString class to wrap the string in the tag. NavigableString represents a traversable string. A NavigableString string is the same as a Unicode string in Python, and supports some features included in traversing and searching the document tree. The following code can view the type of NavigableString.
The output is as follows:
BeautifulSoup The BeautifulSoup object represents the entire content of a document. It is usually treated as a Tag object. The object supports most of the methods described in traversing the document tree and searching the document tree. The following code outputs the type of the soup object, and the output result is the BeautifulSoup object type.
The output is as follows:
Because the BeautifulSoup object is not a real HTML or XML tag, it does not have name and attribute properties. But sometimes it is convenient to check its .name property, so the BeautifulSoup object contains a special property soup.name with a value of [document]. The following code outputs the name property of the BeautifulSoup object, whose value is [document]. Comment The Comment object is a special type of NavigableString object, which is used to process comment objects. The following sample code is used to read the comment content. The code is as follows:
The output is as follows:
2. Traverse the document treeAfter explaining the four objects above, the following knowledge explains traversing the document tree and searching the document tree as well as the commonly used functions of BeatifulSoup. In BeautifulSoup, a tag may contain multiple strings or other tags, which are called subtags of this tag. Let's continue to use the following hypertext protocol to explain:
A Tag may contain multiple strings or other tags, which are the child nodes of this Tag. Beautiful Soup provides many operations and traversal properties of child nodes. For example, get the content of a label subnode:
The output is as follows:
Note: These attributes are not supported on string nodes in Beautiful Soup, because strings do not have child nodes. Node Content If a tag has only one child node and you need to get the content of that child node, you need to use the string attribute to output the content of the node:
The output is as follows:
Call the parent attribute to locate the parent node. If you need to get the node's label name, use parent.name. The example is as follows:
The output is as follows:
Sibling nodes refer to nodes at the same level as the current node. The next_sibling attribute gets the next sibling node of the node, while the previous_sibling attribute gets the previous sibling node of the node. If the node does not exist, None is returned.
Call the attribute next_element to get the next node, and call the attribute previous_element to get the previous node. The code example is as follows:
3. Search the document treeBeautifulSoup defines many search methods, such as find() and find_all(); but find_all() is the most commonly used method, and more methods are similar to traversing the document tree, including parent nodes, child nodes, sibling nodes, etc. The code using the find_all() method is as follows:
The output is as follows:
Use find_all() to find the document content you want. SummarizeAt this point, the basic knowledge and usage of BeautifulSoup within the scope of Ah-chan's understanding have basically been outlined. I hope everyone will forgive me for any mistakes, and we will work together to move forward. refer to BeautifulSoup official website https://blog.csdn.net/Eastmount |
<<: You clearly have a 4G phone, but the operator still wants you to sign up for a 5G package?
>>: Whether to upgrade WiFi 6 routers depends on the consumer's own situation
In recent years, with the rapid growth of mobile ...
What is the role of a host computer gateway? Supp...
Currently, the financial industry is in a critica...
If you've ever bought a Wi-Fi router, you pro...
Internet Control Message Protocol: ICMP is a cont...
80VPS, a long-established Chinese merchant, is a ...
JuHost was registered in Hong Kong in early 2023,...
This article introduces DNS-related knowledge in ...
[51CTO.com original article] The interview with Z...
Kuroit is a foreign hosting company founded in 20...
The last time I shared information about ONEVPS w...
Privacy has been a major concern for businesses l...
Once upon a time, Telnet was my favorite remote l...
Justhost.ru is a Russian hosting company founded ...
Data transmission is the backbone of today's ...