How to replace the Query field in the URL?

[[420519]]

When we write a crawler, we may need to generate a new URL based on the current URL in the crawler. For example, the following pseudo code:

 import re
 current_url = 'https://www.kingname.info/archives/page/2/'  
 current_page = re.search( '/(\d+)' , current_url). group (1)
 next_page = int (current_page) + 1
 next_url = re.sub( '\d+' , str(next_page), current_url)
 make_request(next_url)

The running effect is shown in the figure below:

But sometimes, the page turning parameter is not necessarily a number. For example, on some websites, visit a URL: https://xxx.com/articlelist?category=technology&after=asdrtJKSAZFD

When you access this URL, it returns a JSON string, and this JSON contains the following fields:

 ...
 "paging" : {
 "cursors" : {
 "before" : "MTA3NDU0NDExNDEzNTgz" ,
 "after" : "MTE4OTc5MjU0NDQ4NTkwMgZDZD"  
 }, 
         
 }
 ...

This situation is more common in information flow websites. It can only scroll down infinitely to view the next page, and cannot jump directly by page number. The parameter after the next page is returned every time a request is made. When you want to access the next page, use this parameter to replace the parameter after after= in the current URL.

In this way, replacing the parameters in the URL is not a simple task. Because the URL may have 4 situations:

First page, no after parameter: https://xxx.com/articlelist?category=technology
The first page has the after parameter name but no value: https://xxx.com/articlelist?category=technology&after=
On subsequent pages, there is no content after the after parameter value: https://xxx.com/articlelist?category=technology&after=asdrtJKSAZFD
On the subsequent pages, there is content after the aster parameter value: https://xxx.com/articlelist?category=technology&after=asdrtJKSAZFD&other=abc

You can try to cover these 4 situations and generate the URL of the next page using regular expressions.

In fact, we don't need to use regular expressions. Python's built-in urllib module already provides a solution to this problem. Let's take a look at a piece of code:

 from urllib.parse import urlparse, urlunparse, parse_qs, urlencode 
 
 
 def replace_field(url, name , value):
 parse = urlparse(url)
 query = parse.query
    query_pair = parse_qs(query)
    query_pair[ name ] = value
    new_query = urlencode(query_pair, doseq= True )
    new_parse = parse._replace(query=new_query)
    next_page = urlunparse(new_parse)
 return next_page 
 
 url_list = [
 'https://xxx.com/articlelist?category=technology' ,
 'https://xxx.com/articlelist?category=technology&after=' ,
 'https://xxx.com/articlelist?category=technology&after=asdrtJKSAZFD' ,
 'https://xxx.com/articlelist?category=technology&after=asdrtJKSAZFD&other=abc'  
 ] 
 
 for url in url_list:
    next_page = replace_field(url, 'after' , '0000000' )
 print(next_page)

The running effect is shown in the figure below:

As can be seen from the figure, in these four cases, we can successfully add the parameter after = 0000000 for the next page. There is no need to consider how regular expressions can adapt to all situations.

Among them, urlparse and urlunparse are a pair of opposite functions. The former converts the URL into a ParseResult object, and the latter converts the ParseResult object back to a URL string.

The .query property of the ParseResult object is a string, which is the content after the question mark in the URL, in the following format:

parse_qs and urlencode are also a pair of opposite functions. The former converts the string output by .query into a dictionary, while the latter converts the field into a string in the form of .query:

After using parse_qs to convert the query into a dictionary, you can modify the parameter value and then convert it back again.

Since the .query property of the ParseResult object is a read-only property and cannot be overwritten, we need to call an internal method ._replace to replace the new .query field and generate a new ParseResult object. Finally, convert it back to a URL.

The above is what we introduced today, how to use the functions that come with urllib to replace the fields in the URL.

<<: GSMA: 5G networks will cover two-fifths of the world's population by 2025

>>: 5G in numbers: 5G trends revealed by statistics in the first half of 2021

Blog

VMISS 30% off VPS monthly payment starting from 3.5 Canadian dollars, Korea/Japan/Los Angeles CN2 GIA/9929/CMIN2, etc.

Recommend

Age and technology determine building control lifespan

The average lifespan of an American car is about ...

China Mobile and China Broadcasting Corporation initiate strategic cooperation to jointly build and share 700MHz 5G network

On January 27, China Mobile and China Broadcastin...

How to replace the Query field in the URL?

VMISS 30% off VPS monthly payment starting from 3.5 Canadian dollars, Korea/Japan/Los Angeles CN2 GIA/9929/CMIN2, etc.

RAKsmart server flash sale starts from $30/month, 1Gbps unlimited traffic server starts from $99/month

Detailed explanation of SSL protocol communication process and symmetric encryption and asymmetric encryption in HTTPS

Interviewer asked: What are the functions of the wait and notify methods in threads?

Let’s talk about PHY register, do you know it?

Discussion on interactive control technology of device platform based on gateway

Cisco and partners work together to build the Cisco (Guangzhou) Smart City industry ecosystem

Three tips for solving bandwidth issues in small government offices

The Ultimate Guide to Enterprise Network Management

Recommend

Age and technology determine building control lifespan

China Mobile and China Broadcasting Corporation initiate strategic cooperation to jointly build and share 700MHz 5G network

Musk said: Satellite Internet will be publicly tested within 6 months! It will not be 6G that will subvert 5G

RackNerd Mid-Autumn Festival Promotion, Los Angeles KVM Annual Payment Starting from $9.89

Software-defined data centers face constant challenges: IT departments need to shift their focus

Summary of precautions and common problems in the use of twisted pair cables in weak current engineering

Huawei Enjoy 10S hands-on review: good looks, photography, and battery life

Analysis on the development prospects of China's medical information industry during the 13th Five-Year Plan

How to Choose and Buy Network Automation Tools

Application and development of machine learning tools in data centers

Why consider 800G now?

10 bad habits network administrators should avoid at all costs

Unleashing the power of the tactile internet through 5G networks