[[423414]] This article mainly talks about puppeteer, a tool published and maintained by Google. By studying this article, you will understand its basic usage and common functions. 1. Introduction to Puppeteer Puppeteer is a Node library that provides a high-level API to control Chromium or Chrome through the DevTools protocol. With Puppeteer, you can obtain page DOM nodes, network requests and responses, programmatically manipulate page behaviors, monitor and optimize page performance, obtain page screenshots and PDFs, etc. This artifact can be used to operate the Chrome browser in various ways. 2. Puppeteer core structure Puppeteer's structure also reflects the structure of the browser. Its core structure is as follows: - Browser: This is a browser instance that can have a browser context. A Browser object can be created through puppeteer.launch or puppeteer.connect.
- BrowserContext: This instance defines a browser context that can have multiple pages. When a browser instance is created, a browser context is created by default (it cannot be closed). In addition, you can use browser.createIncognitoBrowserContext() to create an anonymous browser context (it will not share cookies/cache with other browser contexts).
- Page: contains at least one main frame. In addition to the main frame, there may be other frames, such as iframe.
- Frame: A frame in a page. At each point in time, the page exposes the details of the current frame through the page.mainFrame() and frame.childFrames() methods. For this frame, there is at least one execution context
- ExecutionCOntext: represents a JavaScript execution context.
- Worker: Has a single execution context, making it easy to interact with WebWorkers.
3. Basic Use and Common Functions The overall use of this artifact is relatively simple, so let’s start using it. 3.1 Start the Browser The core function is to asynchronously call the puppeteer.launch() function to create a Browser instance according to the corresponding configuration parameters. - const path = require( 'path' );
- const puppeteer = require( 'puppeteer' );
-
- const chromiumPath = path.join (__dirname, '../' , 'chromium/chromium/chrome.exe' );
-
- async function main() {
- // Start the Chrome browser
- const browser = await puppeteer.launch({
- //Specify the browser path
- executablePath: chromiumPath,
- // Whether it is headless browser mode, the default is headless browser mode
- headless: false
- });
- }
-
- main();
3.2 Access Page To access a page, you first need to create a browser context, then create a new page based on the context, and finally specify the URL to be accessed. - async function main() {
- // Start the Chrome browser
- // ...
-
- // A new page is created in a default browser context
- const page1 = await browser.newPage();
-
- //Blank page to access the specified URL
- await page1.goto ( 'https://51yangsheng.com' );
-
- // Create an anonymous browser context
- const browserContext = await browser.createIncognitoBrowserContext();
- // Create a new page in this context
- const page2 = await browserContext.newPage();
- page2.goto ( 'https://www.baidu.com' );
- }
-
- main();
3.3 Device Simulation Often you need browsing results for different types of devices. In this case, you can use device simulation to achieve this. Here is a browser result for an iPhone X device. - async function main() {
- // Start the browser
-
- // Device simulation: simulate an iPhone X
- // user agent
- await page1.setUserAgent( 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1' )
- // Viewport simulation
- await page1.setViewport({
- width: 375,
- height: 812
- });
-
- // Visit a page
- }
-
- main();
3.4 Get DOM Node There are two ways to get DOM nodes. One is to directly call the native function of the page, and the other is to get it by executing js code. - async function main() {
- // Start the Chrome browser
- const browser = await puppeteer.launch({
- //Specify the browser path
- executablePath: chromiumPath,
- // Whether it is headless browser mode, the default is headless browser mode
- headless: false
- });
-
- // A new page is created in a default browser context
- const page1 = await browser.newPage();
-
- //Blank page to access the specified URL
- await page1.goto ( 'https://www.baidu.com' ) ;
-
- // Wait for the title node to appear
- await page1.waitForSelector( 'title' );
-
- // Get the node using the page's own method
- const titleDomText1 = await page1.$eval( 'title' , el => el.innerText);
- console.log(titleDomText1);// Baidu
-
- // Get the node with js
- const titleDomText2 = await page1.evaluate(() => {
- const titleDom = document.querySelector( 'title' );
- return titleDom.innerText;
- });
- console.log(titleDomText2);
- }
-
- main();
3.5 Listening for Requests and Responses Next, let's monitor the request and response of a js script in Baidu. The request event is to monitor the request, and the response event is to monitor the response. - async function main() {
- // Start the Chrome browser
- const browser = await puppeteer.launch({
- //Specify the browser path
- executablePath: chromiumPath,
- // Whether it is headless browser mode, the default is headless browser mode
- headless: false
- });
-
- // A new page is created in a default browser context
- const page1 = await browser.newPage();
-
- page1. on ( 'request' , request => {
- if (request.url() === 'https://s.bdstatic.com/common/openjs/amd/eslx.js' ) {
- console.log(request.resourceType());
- console.log(request.method());
- console.log(request.headers());
- }
- });
-
- page1. on ( 'response' , response => {
- if (response.url() === 'https://s.bdstatic.com/common/openjs/amd/eslx.js' ) {
- console.log(response.status());
- console.log(response.headers());
- }
- })
-
- // The blank page just asks the specified URL
- await page1.goto ( 'https://www.baidu.com' ) ;
- }
-
- main();
3.6 Intercepting a request By default, the request event has only read-only attributes and cannot intercept requests. If you want to intercept the request, you need to start the request interceptor through page.setRequestInterception(value), and then use request.abort, request.continue and request.respond methods to determine the next step of the request. - async function main() {
- // Start the Chrome browser
- const browser = await puppeteer.launch({
- //Specify the browser path
- executablePath: chromiumPath,
- // Whether it is headless browser mode, the default is headless browser mode
- headless: false
- });
-
- // A new page is created in a default browser context
- const page1 = await browser.newPage();
-
- //Intercept request enabled
- await page1.setRequestInterception( true ); // true to enable, false to disable
- page1. on ( 'request' , request => {
- if (request.url() === 'https://s.bdstatic.com/common/openjs/amd/eslx.js' ) {
- // Terminate the request
- request.abort();
- console.log( 'The request was terminated!!!' );
- }
- else {
- // Continue the request
- request.continue () ;
- }
- });
-
- //Blank page to access the specified URL
- await page1.goto ( 'https://www.baidu.com' ) ;
- }
-
- main();
3.7 Screenshots Screenshot is a very useful function. By taking a screenshot, you can save a snapshot, which is convenient for later troubleshooting. (Note: Take screenshots in headless mode, otherwise the screenshots may have problems) - async function main() {
- // Start the browser and access the page
-
- // Screen capture operation, use Page.screenshot function
- // Capture the entire page: The Page.screenshot function captures the entire page by default. Adding the fullPage parameter will capture the full screen.
- await page1.screenshot({
- path: '../imgs/fullScreen.png' ,
- fullPage: true
- });
-
- // Capture the contents of an area on the screen
- await page1.screenshot({
- path: '../imgs/partScreen.jpg' ,
- type: 'jpeg' ,
- quality: 80,
- clip: {
- x: 0,
- y: 0,
- width: 375,
- height: 300
- }
- });
-
- browser.close () ;
- }
-
- main();
3.8 Generate PDF In addition to using screenshots to preserve snapshots, you can also use PDF to preserve snapshots. - async function main() {
- // Start the browser and access the page
-
- // Generate a pdf file based on the web page content, using Page.pdf - Note: This can only be called in headless mode
- await page1.pdf({
- path: '../pdf/baidu.pdf'
- });
-
- browser.close () ;
- }
-
- main();
This article is reproduced from the WeChat public account "Front-end points, lines and surfaces" |