Check out my books on Amazon at https://www.amazon.com/John-Au-Yeung/e/B08FT5NT62
Subscribe to my email list now at http://jauyeung.net/subscribe/
We can get data from web pages with Beautiful Soup.
It lets us parse the DOM and extract the data we want.
In this article, we’ll look at how to scrape HTML documents with Beautiful Soup.
Getting Started
We get started by running:
pip install beautifulsoup
Then we can write:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
to add an HTML string and parse it with the BeautifulSoup
class.
Then we can print the parsed document in the last line.
Get Links and Text
We can get the links from the HTML string with the find_all
method:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
We just pass in the selector for the elements we wan to get.
Also, we can get all the text from the page with get_text()
:
from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.get_text())
Parse an External Document
We can parse an external document by opening it with open
:
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
print(soup.prettify())
Kinds of Objects
We can get a few kinds of objects with Beautiful Soup.
They include Tag
, NavigableString
, BeautifulSoup
, and Comment
.
Tag
A Tag
corresponds to an XML or HTML tag in the original docuemnt.
For example, we can write:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
print(type(tag))
to get the b
tag from the HTML string.
Then we get:
<class 'bs4.element.Tag'>
printed from the last line.
Name
We can get the name of the tag:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
print(tag.name)
Then we see b
printed.
Attributes
We can get attributes from the returned dictionary:
from bs4 import BeautifulSoup
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
print(tag['id'])
We get the b
element.
Then we get the id
value from the returned dictionary.
Conclusion
We can get parse HTML and XML and get various elements, text, and attributes easily with Beautiful Soup.