With the recent development revolutions in the artificial intelligence domain, it became easy to access and use LLMs using available services, such as LlamaIndex and LangChain. But what about extending these services' LLMs with web scraped data?

In this article, we'll explain how to use LLM and web scraping for RAG applications. We'll start by defining their related concepts and then go through a step-by-step tutorial on applying the concepts to both LlamaIndex and LangChain with Python. Let's get started!

What Are Large Language Models (LLMs)?

Large Language Models (LLMs) are machine learning models specialized in human text. They can understand and generate text based on a given input. LLMs are able to reply to a given prompt input by processing the text and evaluating it using their trained data.

In simple terms, LLMs are built using a specific type of machine learning model called neural networks. These networks are trained on a significant amount of pure text data. After receiving input, it's processed in two major steps:

Tokenization The prompt text input gets broken into smaller units called tokens. These tokens can be words, characters, or even whole phrases.
Generation After the input is processed, the response is generated based on trained data context through sequence generation, which represents creating one token at a time.

Using LLM for web scraping enables various use cases due to its capabilities in text understanding, such as sentiment analysis, answering questions, summarizing text, or assisting in code generation such as using ChatGPT for web scraping.

What Is Retrieval Augmented Generation (RAG)?

Retrieval Augmented Generation (RAG) is a technique used to optimize a large language model output. To understand why it is used, let's explore a commonly encountered annoyance.

An LLM can be trained with terabytes of data and billions of parameters. However, it may lack understanding of a specific, niche, or private business domain. At the same time, re-training an LLM model is a time-consuming task and requires lots of engineering resources.

The RAG technique allows for extending a pre-trained LLM model with additional datasets. This approach enables the model to be aware and up-to-date with a specific context, making it far more accurate at answering questions or providing assistance with submitted prompts.

How to Use Web Scraping For RAG?

In the following sections, we'll go through a step-by-step guide on applying web scraping with LLMs to create a context-augmented RAG model.

Such an approach can be approached using the following steps:

Scrape web page data.
Training LLMs with the scraped data.

That being said, there are two challenges associated with this web scraping LLM workflow:

LLMs can't interpret or understand HTML data.
Native communication with LLMs can be complex.

To address the above challenges, we'll use Scrapfly for web page scraping as text or markdown, as both data types are accessible by LLMs. As for LLM communication, we'll use LlmaIndex and LangChain.

Scrape Web Pages For LLMs With Scrapfly

It's common for web scraping tools to send HTTP requests to web pages in order to retrieve their data as HTML. However, utilizing web scraping as the RAG data source, we have to extract the web data in a format that LLMs understand, either as Text or Markdown.

For this, we'll use Scrapfly, a web scraping API that allows specifying the data extraction format for various options, including Text and Markdown. Moreover, Scrapfly allows for scraping at scale by providing:

Anti-scraping protection bypass - For bypassing anti-scraping protection mechanisms, such as Cloudflare.
Millions of residential proxy IPs in +50 countries - For preventing IP address blocking and throttling while also allowing for scraping from almost any geographical location.
Easy to use Python and Typescript SDKs, as well as Scrapy integration.
And much more!

ScrapFly service does the heavy lifting for you!

Here's how to use Scrapfly for LLM web scraping as Markdown using the Python SDK:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your Scrapfly API key")

api_response: ScrapeApiResponse = scrapfly.scrape(
    ScrapeConfig(
        # target website URL
        url="https://web-scraping.dev/login",
        # bypass anti scraping protection
        asp=True,
        # set the proxy location to a specific country
        country="US",
        # specify the proxy pool
        proxy_pool="public_residential_pool",
        # enable JavaScript rendering (use a cloud browser)
        render_js=True,
        # specify the web scraping format
        format="markdown"
    )
)

# get the results
data = api_response.scrape_result['content']
print(data)
"""
[web-scraping.dev](https://web-scraping.dev/)

  * Docs 
    * [API](https://web-scraping.dev/docs)
    * [Graphql](https://web-scraping.dev/api/graphql)
  * [Products](https://web-scraping.dev/products)
  * [Reviews](https://web-scraping.dev/reviews)
  * [Testimonials](https://web-scraping.dev/testimonials)

  * [login](https://web-scraping.dev/login)
  ....
"""

For the rest of this guide, we'll be using Scrapfly to extract the data required for RAG system building. To follow along, sign up to get your Scrapfly API key.

LlamaIndex

LlamaIndex is an open-source framework for connecting datasets into large language models. It provides the necessary components required for building context-augmented LLMs.

The context augmentation allows a model to be aware of the provided datasets, allowing for various use cases, including:

Retrieval-augmented generation (RAG) models.
Document understanding, summarization, and extraction.
Automated agents with reasoning and decision-making capabilities.
Multi-model applications with both text and image understanding.

In order to use LlamaIndex to build RAG models, we'll use it to interface web scraping for LLMs. For this, we'll utilize Scrapfly's LlamaIndex web scraping integration. It allows retrieving web page data into markdown documents, accessible for LLMs.

Setup

First, let's install the required Python packages:

llama-index: The LlamaIndex Python SDK. We'll use it to build the RAG model on top of an LLM.
llama-index-readers-web: The LlamaIndex web loaders, which contains Scrapfly's document loader.
scrapfly-sdk: Scrapfly Python SDK. It's required by the Scrapfly document loader.

The above packages can be installed using the following pip command:

pip install llama-index llama-index-readers-web scrapfly-sdk

Using LlamaIndex ScrapflyReader

Let's start by exploring using LlamaIndex web scraping to retrieve a web page to feed the LLM model. For this, we'll use LlamaIndex ScrapflyReader:

from llama_index.readers.web import ScrapflyReader

# Initiate ScrapflyReader with your Scrapfly API key
scrapfly_reader = ScrapflyReader(
    api_key="Your Scrapfly API key",
    ignore_scrape_failures=True, # Ignore unprocessable web pages and log their exceptions
)

scrapfly_scrape_config = {
    "asp": True, # Bypass scraping blocking and antibot solutions, like Cloudflare
    "render_js": True, # Enable JavaScript rendering with a cloud headless browser
    "proxy_pool": "public_residential_pool", # Select a proxy pool (datacenter or residnetial)
    "country": "us", # Select a proxy location
    "auto_scroll": True, # Auto scroll the page
    "js": "", # Execute custom JavaScript code by the headless browser
}

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"], # List of URLs to scrape
    scrape_config=scrapfly_scrape_config, # Pass the scrape config
    scrape_format="markdown", # The scrape result format, either `markdown`(default) or `text`
)

print(documents)

The above code is fairly straightforward. Let's break down its workflow:

The ScrapflyReader gets initialized using the Scrapfly API key.
A scrapfly_scrape_config object is created. It represents the Scrapfly API parameters to use with each scrape request.
The load_data method is used to pass a list of URLs to scrape for LLM as markdown and convert them to documents.

Now that the documents are ready, let's proceed with the RAG model creation by augmenting an LLM with the scraped data.

LlamaIndex RAG Model

LlamaIndex has integrations with almost all the available LLMs out there. These include cloud LLMs, such as OpenAI, Mistral, and Gemini, as well as local LLMs, such as Ollama. However, using cloud LLMs requires having a subscription to the selected provider. Hence, using local models like Ollama can be a great alternative.

In this guide on using web scraping for retrieval-augmented generation, we'll use OpenAI as the LLM, which is the default LLM for LlamaIndex SDK. For instructions on using other LLMs, refer to the official LlamaIndex examples documentation.

Here's how to use web scraping for RAG models using OpenAI. First, get your OpenAI key and use the following code:

import os

from llama_index.readers.web import ScrapflyReader
from llama_index.core import VectorStoreIndex

scrapfly_reader = ScrapflyReader(
    api_key="Your Scrapfly API key",
    ignore_scrape_failures=True,
)

# Load documents from URLs as markdown
documents = scrapfly_reader.load_data(
    urls=["https://web-scraping.dev/products"]
)

# Set the OpenAI key as a environment variable
os.environ['OPENAI_API_KEY'] = "Your OpenAI Key"

# Create an index store for the documents
index = VectorStoreIndex.from_documents(documents)

# Create the RAG engine with using the index store
query_engine = index.as_query_engine()

# Submit a query
response = query_engine.query("What is the flavor of the dark energy potion?")
print(response)
"The flavor of the dark energy potion is bold cherry cola."

Here, we start by creating a VectorStoreIndex, a component required by the RAG model. It splits the documents into a set of chunks, sets the relationship between their text, and saves them into memory. Then, we create a query_engine over the store index using the LLM for querying.

The above query prompt example briefly illustrates how to use retrieval augmented generation with web scraping. We asked a question regarding the scraped data and got the correct result!

That being said, RAG for web scraping can be utilized for further advanced data processing tasks. For example, let's attempt to the web page data into a clean JSON dataset using a query prompt:

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()

response = query_engine.query("Add the product data into a JSON dataset as an array of objects")
print(response)

From the query response, we can observe that the RAG model took care of the data parsing, processing, and cleaning:

[
    {
        "name": "Box of Chocolate Candy",
        "url": "https://web-scraping.dev/product/1",
        "description": "Indulge your sweet tooth with our Box of Chocolate Candy...",
        "price": 24.99
    },
    ....
]

LangChain

LangChain is another popular framework for communicating with LLMs. It provides several components for working with and processing languages for several use cases, including:

Building large language models
Chatbots for context-augmented conversations
Agents with action-taking capabilities
Retrieval augmented generation (RAG) applications

To approach the use of LLMs and web scraping for LangChain RAG models, we will utilize Scrapfly's LangChain web scraping integration. It interfaces the Scrapfly API capabilities , including retrieving web pages' data as Markdown and Text.

Setup

Let's start with the installation process. We'll install the core LangChain Python packages, as well as additional utility packages:

langchain: The core LangChain Python SDK.
langchainhub: LangChain hub to pull the RAG prompt template.
langchain-community: A package containing third-party LangChain integration tools, including the ScrapflyLoader.
langchain-chroma: LangChain's Chroma class for creating vector stores.
langchain-openai: OpenAI integration, which we'll use as the LLM.
langchain-text-splitters: A utility tool for splitting text on documents.
scrapfly-sdk: Scrapfly Python SDK. It's required by the LangChain ScrapflyLoader.

Install the above packages using the following pip command:

pip install langchain langchainhub langchain-community langchain-chroma langchain-openai langchain-text-splitters scrapfly-sdk

Using LangChain ScrapflyLoader

The first step in building LangChain RAG models is extracting the data to augment the LLM's context. For this, we'll use the ScrapflyLoader to scrape a web page as markdown:

from langchain_community.document_loaders import ScrapflyLoader

scrapfly_scrape_config = {
    "asp": True, # Bypass scraping blocking and antibot solutions, like Cloudflare
    "render_js": True, # Enable JavaScript rendering with a cloud headless browser
    "proxy_pool": "public_residential_pool", # Select a proxy pool (datacenter or residnetial)
    "country": "us", # Select a proxy location
    "auto_scroll": True, # Auto scroll the page
    "js": "", # Execute custom JavaScript code by the headless browser
}

scrapfly_loader = ScrapflyLoader(
    urls=["https://web-scraping.dev/products"],
    api_key="Your ScrapFly API key",
    continue_on_failure=True, # Ignore unprocessable web pages and log their exceptions
    scrape_config=scrapfly_scrape_config, # Pass the scrape_config object
    scrape_format="markdown", # The scrape result format, either `markdown` (default) or `text`
)

# Load documents from URLs as markdown
documents = scrapfly_loader.load()

print(documents)

Here, we create a scrapfly_scrape_config object with the desired Scrapfly API parameters to use with the scrape requests. Then, we pass it to the ScrapflyLoader along the web page URLs to scrape.

The next step is to load the scraped markdown documents into an LLM for the LangChain RAG application building.

LangChain RAG Model

LangChain has native integrations with tens of LLM providers through both cloud and local setups. In this RAG application using web scraping and LangChain example, we'll be using OpenAI as the LLM of choice.

The first step is creating an OpenAI key from the account dashboard, an OpenAI subscription is required for this step. A great alternative is using local LLM frameworks, such as Ollama. Refer to the documentation example for the usage instructions.

Here's how to utilize web scraping with LangChain to create a RAG application with OpenAI as an LLM:

import os

from langchain import hub
from langchain_chroma import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import ScrapflyLoader

scrapfly_loader = ScrapflyLoader(
    urls=["https://web-scraping.dev/products"],
    api_key="Your Scrapfly API key",
    continue_on_failure=True,
)

# Load the web page data into markdown documents
documents = scrapfly_loader.load()

# Set the OpenAI key as an environment variable
os.environ["OPENAI_API_KEY"] = "Your OpenAI key"

# Create a chunk splitter with 1000 chars each and 200 chars to overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

# Save the documents into splits
splits = text_splitter.split_documents(documents)

# Create a vector store
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Create a retriever object to support document searches
retriever = vectorstore.as_retriever()

In the above code, we start by retrieving the web pages as mark documents using ScrapflyLoader. After the documents are retrieved, they get processed through a few steps to create a search vector store:

We initialize a text_splitter to split the documents into chunks. A large chunk makes fitting documents into the limited model context harder. The chunk overlap prevents important words from being separated from their full context during the process.
We create a vectorstore with the divided chunks, using OpenAI as the embedding model.
We then established a retriever object to fetch the relevant documents based on the submitted prompt.

Next, we'll use the vector store retriever with OpenAI to build the RAG chain model:

#....
retriever = vectorstore.as_retriever()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Use OpenAI as the LLM model
model = ChatOpenAI()

# Use rag-prompt as the prompt template https://smith.langchain.com/hub/rlm/rag-prompt
prompt = hub.pull("rlm/rag-prompt")

# Create a QA retriever chain to pass the documents with each prompt
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

# Submit a prompt query
response = rag_chain.invoke("What are the chocolate candy box flavors?")
print(response)
"The chocolate candy box flavors include zesty orange and sweet cherry."

Let's break down the above code:

Define a format_docs function to format the retriever's returned document string.
Use OpenAI as the LLM embeding model.
Pull the rag-prompt template from the LangChain hub to instruct the model. Refer to the prompt templating docs for creating custom templates.
Create the rag_chain as a pipeline to process incoming prompt queries.

From the prompt response, we can see that the LangChain RAG model can effectively understand and query the extracted data!

FAQ

To wrap up this guide on building a RAG system for web scraping, let's have a look at some frequently asked questions.

Why use web scraping for RAG applications?

Using web scraping for RAG applications can empower various use cases based on the data domain, including:

Private or domain-specific data for enhanced business utilities.
Opinionated text data used for research purposes, which are found on public social media platforms, such as Twitter and Reddit.

What is the difference between RAG and LLM?

LLM refers to a large language model representing a neural network model trained on a vast amount of text data, making it able to understand human text. Popular LLM examples are ChatGPT and Gemini. On the other hand, RAG refers to retrieval-augmented generation. It represents enhancing ready LLMs with custom training data to make the LLM's context aware of the provided datasets.

Can LLMs understand HTML?

The short answer is no. LLMs are trained to comprehend linear text data, but HTML follows a tree-based structure, which is challenging for LLMs to interpret and understand. Hence, using web scraping for LLMs requires the extracted data to be parsed. Such a solution is provided by Scrapfly's format feature, enabling scraping any web page as text or markdown.

Summary

In this guide, we have explained what LLMs and RAG applications are and how they compare to each other: LLMs are the text models themselves, which get fed with custom data to build the RAG application.

Then, we went through a step-by-step guide to utilizing LLM for web scraping examples for building RAG systems using both LlamaIndex and LangChain. In a nutshell, the required steps are:

Scrape the web page as text or markdown documents.
Load the documents into a vector store.
Use the generated vector store with an LLM embedding model to augment its context.

How to power up LLMS with Web Scraping and RAG

What Are Large Language Models (LLMs)?

What Is Retrieval Augmented Generation (RAG)?

How to Use Web Scraping For RAG?

Scrape Web Pages For LLMs With Scrapfly

LlamaIndex

Setup

Using LlamaIndex ScrapflyReader

LlamaIndex RAG Model

LangChain

Setup

Using LangChain ScrapflyLoader

LangChain RAG Model

FAQ

Why use web scraping for RAG applications?

What is the difference between RAG and LLM?

Can LLMs understand HTML?

Summary