Big and small! The little sister tells you everything about BeautifulSoup

Big and small! The little sister tells you everything about BeautifulSoup

[[427165]]

Learn more about BeautifulSoup Scraping

The first article above is the first part of the detailed explanation of the basic knowledge of BeautifulSoup crawler. It mainly introduces the installation process and introduction of BeautifulSoup crawler. At the same time, we quickly learned the relevant knowledge points of using BeautifulSoup technology to locate tags and obtain tag content. Today's article will deeply introduce the detailed syntax of BeautifulSoup technology and its related usage.

1. BeautifulSoup Object

BeautifulSoup converts complex HTML documents into a tree structure, where each node is a Python object. The official BeautifulSoup documentation summarizes all objects into the following four types:

  • Tags
  • NavigableString
  • BeautifulSoup
  • Comment

Next, we will introduce the four objects of BeautifulSoup in detail:

Tags

The Tag object represents a tag in an XML or HTML document. In layman's terms, it is a tag in HTML. This object is the same as the tag in the HTML or XML native document. Tag has many methods and properties. It is defined as soup.Tag in BeautifulSoup, where Tag is a tag in HTML, such as a, title, etc. The result returns the complete tag content, including the attributes and content of the tag. For example, the following example is Tag:

  1. <title>BeautifulSoup Technical Details</title>
  2. <p class= "title" >Hello</p>
  3. <p class= "con" >Python technology</p>

In the above HTML code, title and p are tags, and the content between the start tag and the end tag is the tag. The code for obtaining the tag is as follows:

  1. #Create a local file soup object
  2. soup = BeautifulSoup( open ( 'test.html' , 'rb' ), "html.parser" )
  3. #Get a tag
  4. a = soup.a #Tag
  5. print( 'The content of tag a is:' , a)

In addition, the most important attributes of Tag are name and attrs.

  • name

The name attribute is used to get the tag name of the document tree. If you want to get the name of the title tag, just use the soup.title.name code. For internal tags, the output value is the name of the tag itself.

  • attrsattrs is the abbreviation of attributes. Attributes are important content of web page tags. A tag may have many attributes, for example:
  1. <a href= "https://www.baidu.com" class= "xiaodu" id= "l1" > ddd</a>

The above example has two attributes, one is the class attribute, the corresponding value is "xiaodu"; the other is the id attribute, the corresponding value is "l1". The tag attribute operation method is the same as the Python dictionary. The code for getting all the attributes of the p tag is as follows, and a dictionary type value is obtained, which gets the attributes and attribute values ​​of the first paragraph p.

  1. # Get attributes
  2. print(soup.p.attrs)
  3.  
  4. # Get attribute values
  5. print(soup.a[ 'class' ])
  6. #[u 'xiaodu' ]
  7. print(soup.a.get( 'class' ))
  8. #[u 'l1' ]

Each tag in BeautifulSoup may have many attributes, which can be obtained through ".attrs". The attributes of the tag can be modified, deleted or added.

NavigableString

NavigableString is also called traversable string. Strings are often contained in tags. BeautifulSoup uses the NavigableString class to wrap the string in the tag.

BeautifulSoup uses the NavigableString class to wrap the string in the tag. NavigableString represents a traversable string. A NavigableString string is the same as a Unicode string in Python, and supports some features included in traversing and searching the document tree. The following code can view the type of NavigableString.

  1. # coding=utf-8
  2. from bs4 import BeautifulSoup
  3. soup = BeautifulSoup( open ( 'test.html' , 'rb' ), "html.parser" )
  4. tag = soup.title
  5. print(type(tag.string))

The output is as follows:

  1. <class 'bs4.element.NavigableString' >

BeautifulSoup

The BeautifulSoup object represents the entire content of a document. It is usually treated as a Tag object. The object supports most of the methods described in traversing the document tree and searching the document tree. The following code outputs the type of the soup object, and the output result is the BeautifulSoup object type.

  1. # coding=utf-8
  2. from bs4 import BeautifulSoup
  3. soup = BeautifulSoup( open ( 'test.html' , 'rb' ), "html.parser" )
  4. tag = soup.title
  5.  
  6. print(type(soup))

The output is as follows:

  1. <class 'bs4.BeautifulSoup' >

Because the BeautifulSoup object is not a real HTML or XML tag, it does not have name and attribute properties. But sometimes it is convenient to check its .name property, so the BeautifulSoup object contains a special property soup.name with a value of [document]. The following code outputs the name property of the BeautifulSoup object, whose value is [document].

Comment

The Comment object is a special type of NavigableString object, which is used to process comment objects. The following sample code is used to read the comment content. The code is as follows:

  1. markup = "<b><!-- hello comment code --></b>"  
  2. soup = BeautifulSoup(markup, "html.parser" )
  3. comment = soup.b.string
  4. print(type(comment))
  5. print(comment)
  6.      
  7. if __name__ == '__main__' :
  8. mark()

The output is as follows:

  1. <class 'bs4.BeautifulSoup' >
  2. <class 'bs4.element.Comment' >
  3. hello comment code

2. Traverse the document tree

After explaining the four objects above, the following knowledge explains traversing the document tree and searching the document tree as well as the commonly used functions of BeatifulSoup. In BeautifulSoup, a tag may contain multiple strings or other tags, which are called subtags of this tag.

Let's continue to use the following hypertext protocol to explain:

  1. <!DOCTYPE html>
  2. <html lang= "en" >
  3. <head>
  4. <title>BeautifulSoup Technical Details</title>
  5. </head>
  6. <body>
  7. <p class= "title" >Hello</p>
  8. <p class= "con" >Python technology</p>
  9.  
  10. <a href= "https://www.baidu.com" class= "xiaodu" id= "l1" > ddd</a>
  11.  
  12. </body>
  13. </html>
  • Child Nodes

A Tag may contain multiple strings or other tags, which are the child nodes of this Tag. Beautiful Soup provides many operations and traversal properties of child nodes.

For example, get the content of a label subnode:

  1. # coding=utf-8
  2. from bs4 import BeautifulSoup
  3. soup = BeautifulSoup( open ( 'test.html' , 'rb' ), "html.parser" )
  4. tag = soup.title
  5.  
  6. print(soup.head.contents)

The output is as follows:

  1. [ '\n' , <title>BeautifulSoup Technical Detailed Explanation</title>, '\n' ]

Note: These attributes are not supported on string nodes in Beautiful Soup, because strings do not have child nodes.

Node Content

If a tag has only one child node and you need to get the content of that child node, you need to use the string attribute to output the content of the node:

  1. # coding=utf-8
  2. from bs4 import BeautifulSoup
  3. soup = BeautifulSoup( open ( 'test.html' , 'rb' ), "html.parser" )
  4. tag = soup.title
  5.  
  6. print(soup.head.string)
  7.  
  8. print(soup.title.string)

The output is as follows:

  1. None
  2. BeautifulSoup Technical Details
  • Parent Node

Call the parent attribute to locate the parent node. If you need to get the node's label name, use parent.name. The example is as follows:

  1. # coding=utf-8
  2. from bs4 import BeautifulSoup
  3. soup = BeautifulSoup( open ( 'test.html' , 'rb' ), "html.parser" )
  4. tag = soup.title
  5.  
  6. p = soup.p
  7. print(p.parent)
  8. print( p.parent.name )
  9.  
  10. content = soup.head.title.string
  11. print(content.parent)
  12. print( content.parent.name )

The output is as follows:

  1. <body>
  2. <p class= "title" >Hello</p>
  3. <p class= "con" >Python technology</p>
  4. <a class= "xiaodu" href= "https://www.baidu.com" id= "l1" > ddd</a>
  5. </body>
  6. body
  7. <title>BeautifulSoup Technical Details</title>
  8. title
  • Brother Node

Sibling nodes refer to nodes at the same level as the current node. The next_sibling attribute gets the next sibling node of the node, while the previous_sibling attribute gets the previous sibling node of the node. If the node does not exist, None is returned.

  1. print(soup.p.next_sibling)
  2. print(soup.p.prev_sibling)
  • Front and rear nodes

Call the attribute next_element to get the next node, and call the attribute previous_element to get the previous node. The code example is as follows:

  1. print(soup.p.next_element)
  2. print(soup.p.previous_element)

3. Search the document tree

BeautifulSoup defines many search methods, such as find() and find_all(); but find_all() is the most commonly used method, and more methods are similar to traversing the document tree, including parent nodes, child nodes, sibling nodes, etc. The code using the find_all() method is as follows:

  1. # coding=utf-8
  2. from bs4 import BeautifulSoup
  3. soup = BeautifulSoup( open ( 'test.html' , 'rb' ), "html.parser" )
  4. tag = soup.title
  5.  
  6. urls = soup.find_all( 'p' )
  7. for u in urls:
  8. print(u)

The output is as follows:

  1. <p class= "title" >Hello</p>
  2. <p class= "con" >Python technology</p>

Use find_all() to find the document content you want.

Summarize

At this point, the basic knowledge and usage of BeautifulSoup within the scope of Ah-chan's understanding have basically been outlined. I hope everyone will forgive me for any mistakes, and we will work together to move forward.

refer to

BeautifulSoup official website https://blog.csdn.net/Eastmount

<<:  You clearly have a 4G phone, but the operator still wants you to sign up for a 5G package?

>>:  Whether to upgrade WiFi 6 routers depends on the consumer's own situation

Recommend

MWC19 Shanghai | Data center 400G solution, AI empowers the future

In recent years, with the rapid growth of mobile ...

JuHost: Hong Kong VPS 40% off from $2.99/month, 1GB/20GB SSD/1TB@100Mbps

JuHost was registered in Hong Kong in early 2023,...

How does DNS affect your surfing speed?

This article introduces DNS-related knowledge in ...

Why is Telnet insecure? Let's take a look at usernames and passwords

Once upon a time, Telnet was my favorite remote l...

Twisted Pair vs. Fiber Optic Cable Advantages and Challenges

Data transmission is the backbone of today's ...