Today I'll show you a way to scrape news headlines in python in under 10 lines of code!
Let's get started...
First of all, make sure to import these libraries at the beginning of your python script:
import requests
from bs4 import BeautifulSoup
For this tutorial, I'll be using BBC news as my news source, use these 2 lines of code to get it's url:
url='https://www.bbc.com/news'
response = requests.get(url)
Now we're ready to scrape using BeautifulSoup!
Head over to BBC news and inspect a news headline by right clicking and pressing inspect.
As you'll see, all news headlines are contained within an "h3" tag:
Now add these 4 lines of code to scrape and display all the h3 tags from BBC news:
soup = BeautifulSoup(response.text, 'html.parser')
headlines = soup.find('body').find_all('h3')
for x in headlines:
print(x.text.strip())
- First, we define "soup" as the innerHTML of the BBC news webpage.
- Next, we define "headlines" as an array of all h3 tags found within the webpage.
- Finally, paddle through the "headlines" array and display all of it's contents one by one ridding each element of it's outerHTML using the "text.strip()" method.
Full code
import requests
from bs4 import BeautifulSoup
url='https://www.bbc.com/news'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
headlines = soup.find('body').find_all('h3')
for x in headlines:
print(x.text.strip())
Now if you run your script, your output should look something like this:
Byeeeee👋