A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Stokry - Oct 11 '20 - - Dev Community

Hello everybody, I want to show you a great scraper that is called AutoScraper, cool name isn't it? 😃 😃 . A Smart, Automatic, Fast, and Lightweight Web Scraper for Python which makes web scraping easy. It gets a URL or the HTML content of a web page and a list of sample data that we want to scrape from that page. This data can be text, URL, or any HTML tag value of that page. It learns the scraping rules and returns similar elements. Then you can use this learned object with new URLs to get similar content or the same element of those new pages.

Install latest version from git repository using pip:

$ pip install git+https://github.com/alirezamika/autoscraper.git
Enter fullscreen mode Exit fullscreen mode

Install from PyPI:

pip install autoscraper
Enter fullscreen mode Exit fullscreen mode

Install from source:

python setup.py install
Enter fullscreen mode Exit fullscreen mode

How to use

Getting similar results

Say we want to fetch all related post titles in a stackoverflow page:

from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

wanted_list = ["How to call an external command?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
print(result)
Enter fullscreen mode Exit fullscreen mode

Here's the output:

[
    'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?', 
    'How to call an external command?', 
    'What are metaclasses in Python?', 
    'Does Python have a ternary conditional operator?', 
    'How do you remove duplicates from a list whilst preserving order?', 
    'Convert bytes to a string', 
    'How to get line count of a large file cheaply in Python?', 
    "Does Python have a string 'contains' substring method?", 
    'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]
Enter fullscreen mode Exit fullscreen mode

Now you can use the scraper object to get related topics of any stackoverflow page:

scraper.get_result_similar('https://stackoverflow.com/questions/606191/convert-bytes-to-a-string')
Enter fullscreen mode Exit fullscreen mode

Getting exact result

Say we want to scrape live stock prices from yahoo finance:

from autoscraper import AutoScraper

url = 'https://finance.yahoo.com/quote/AAPL/'

wanted_list = ["124.81"]

scraper = AutoScraper()

the url (html=html_content)
result = scraper.build(url, wanted_list)
print(result)
Enter fullscreen mode Exit fullscreen mode

Note that you should update the wanted_list if you want to copy this code, as the content of the page dynamically changes.

Another example: Say we want to scrape the about text, number of stars and the link to issues of Github repo pages:

from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper'

wanted_list = ['A Smart, Automatic, Fast and Lightweight Web Scraper for Python', '2.2k', 'https://github.com/alirezamika/autoscraper/issues']

scraper = AutoScraper()
scraper.build(url, wanted_list)
Enter fullscreen mode Exit fullscreen mode

Pretty sweet ha?

Grouping results and removing unwanted ones

Here we want to scrape product name, price and rating from ebay product pages:

url = 'https://www.ebay.com/itm/Sony-PlayStation-4-PS4-Pro-1TB-4K-Console-Black/203084236670' 

wanted_list = ['Sony PlayStation 4 PS4 Pro 1TB 4K Console - Black', 'US $349.99', '4.8'] 

scraper.build(url, wanted_list)
Enter fullscreen mode Exit fullscreen mode

The items which we wanted have been on multiple sections of the page and the scraper tries to catch them all. So it may retrieve some extra information compared to what we have in mind. Let's run it on a different page:

scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523')
Enter fullscreen mode Exit fullscreen mode

The result:

[
    "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti",
    'ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB',
    'US $1,229.49',
    '5.0'
]
Enter fullscreen mode Exit fullscreen mode

As we can see we have one extra item here. We can run the get_result_exact or get_result_similar method with grouped=True parameter. It will group all results per its scraping rule:

scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523', grouped=True)
Enter fullscreen mode Exit fullscreen mode

Output:

{
    'rule_sks3': ["Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti"],
    'rule_d4n5': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'],
    'rule_fmrm': ['ACER Predator Helios 300 i7-9750H 15.6" 144Hz FHD GTX 1660Ti 16GB 512GB SSD⚡RGB'],
    'rule_2ydq': ['US $1,229.49'],
    'rule_buhw': ['5.0'],
    'rule_vpfp': ['5.0']
}
Enter fullscreen mode Exit fullscreen mode

Now we can use keep_rules or remove_rules methods to prune unwanted rules:

scraper.keep_rules(['rule_sks3', 'rule_2ydq', 'rule_buhw'])

scraper.get_result_exact('https://www.ebay.com/itm/Acer-Predator-Helios-300-15-6-144Hz-FHD-Laptop-i7-9750H-16GB-512GB-GTX-1660-Ti/114183725523')
Enter fullscreen mode Exit fullscreen mode

And now the result only contains the ones which we want:

[
    "Acer Predator Helios 300 15.6'' 144Hz FHD Laptop i7-9750H 16GB 512GB GTX 1660 Ti",
    'US $1,229.49',
    '5.0'
]
Enter fullscreen mode Exit fullscreen mode

This is one of those modules you definitely need to try, I've been using it for a few days and I'm very satisfied.
This is the link so you can check it out.

Thank you all.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .