Dynamic web scraping usually uses some Python libraries, such as requests
to handle HTTP requests, selenium
to simulate browser behavior, or pyppeteer
. The following article will focus on the use of selenium
.
A brief introduction to selenium
selenium
is a tool for testing web applications, but it is also often used for web scraping, especially when it is necessary to scrap web content dynamically generated by JavaScript. selenium
can simulate user behavior in the browser, such as clicking, entering text, and getting web page elements.
Python dynamic web scraping example
First, make sure you have selenium
installed. If not, you can install it via pip:
pip install selenium
You also need to download the WebDriver for the corresponding browser. Assuming we use Chrome browser, you need to download ChromeDriver and make sure its path is added to the system environment variables, or you can specify its path directly in the code.
Here is a simple example to grab the title of a web page:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# Setting up webdriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# Open the webpage
driver.get('https://www.example.com')
# Get the webpage title
title = driver.title
print(title)
# Close the browser
driver.quit()
This script will open example.com
, get its title, and print it out.
Note that webdriver_manager
is a third-party library that automatically manages WebDriver versions. If you don't want to use it, you can also manually download WebDriver and specify the path.
Dynamic web pages may involve JavaScript rendered content. selenium
can wait for these elements to load before operating, which is very suitable for processing such web pages.
Set proxy when scraping dynamic web pages in python
When using Python to crawl dynamic web pages, you often use a proxy. The use of a proxy avoids many obstacles on the one hand, and speeds up work efficiency on the other.
We have introduced the installation of selenium
above. In addition, you also need to download the WebDriver of the corresponding browser and make sure its path is added to the system's environment variables, or you can specify its path directly in the code.
After completing the above steps, we can configure the proxy and scrap dynamic web pages:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# Set Chrome options
chrome_options = Options()
chrome_options.add_argument('--proxy-server=http://your_proxy_ip:port')
# Specify the WebDriver path (if you have added the WebDriver path to the system environment variables, you can skip this step)
# driver_path = 'path/to/your/chromedriver'
# driver = webdriver.Chrome(executable_path=driver_path, options=chrome_options)
# If WebDriver path is not specified, the default path is used (make sure you have added WebDriver to your system environment variables)
driver = webdriver.Chrome(options=chrome_options)
# Open the webpage
driver.get('https://www.example.com')
# Get the webpage title
title = driver.title
print(title)
# Close the browser
driver.quit()
In this example, --proxy-server=http://your_proxy_ip:port
is the parameter for configuring the proxy. You need to replace your_proxy_ip
and port
with the IP address and port number of the proxy server you actually use.
If your proxy server requires authentication, you can use the following format:
chrome_options.add_argument('--proxy-server=http://username:password@your_proxy_ip:port')
Where username
and password
are the username and password of your proxy server.
After running the above code, selenium
will access the target web page through the configured proxy server and print out the title of the web page.
How to specify the path to ChromeDriver?
ChromeDriver is part of Selenium WebDriver. It interacts with the Chrome browser through the WebDriver API to implement functions such as automated testing and web crawlers.
Specifying the path of ChromeDriver mainly involves the configuration of environment variables. Here are the specific steps:
1. Find the installation location of Chrome
You can find it by right-clicking the Google Chrome shortcut on the desktop and selecting "Open file location".
2. Add the installation path of Chrome to the system environment variable Path
This allows the system to recognize ChromeDriver at any location.
3. Download and unzip ChromeDriver
Make sure to download the ChromeDriver that matches the version of the Chrome browser and unzip it to an exe program.
4. Copy the exe file of ChromeDriver to the installation path of Chrome
In this way, when you need to use ChromeDriver, the system can automatically recognize and call it
The above is the application of selenium and webdriver in python dynamic web crawling, and how to avoid it when crawling web pages. Of course, you can also practice actual operations through the above examples.