Exploring AI Development: RAG

Introduction#

Before explaining what RAG is, let's first discuss the shortcomings of the models or LLMs we are currently using:

Providing false information in the absence of answers. (Can be nonsensical)
Offering outdated or generic information when users need specific current responses. (Because the model has a training cutoff date)
Generating responses from non-authoritative sources. (Since the training content comes from the internet, there may be some inaccuracies)
Due to terminology confusion, different training sources use the same terms to discuss different things, leading to inaccurate responses.

Another important and more practical reason is that we have some private data that certain models on the market definitely have not been trained on, so querying our own dataset becomes impossible.

Thus, RAG (Retrieval-Augmented Generation) emerged — it modifies the interaction with large language models (LLMs) to reference a specified set of documents when responding to user queries, prioritizing this information over extracting information from its vast static training data. This way, LLMs can utilize domain-specific and/or up-to-date information. Use cases include allowing chatbots to access internal company data or providing factual information solely from authoritative sources.

Process#

First, we need to convert some of our knowledge data into vectors using a vector model and then store it in a vector database.

The rest involves user input, which we will convert into vectors, using the user's input vector to retrieve from the current vector database (Retrieval), matching the closest few pieces of data based on similarity, and combining the retrieved data sources with the user's query using a large model (Augmented) to generate (Generation). Thus, the complete RAG is achieved.

Code Implementation#

As a front-end developer, the most familiar language is JavaScript, and fortunately, a well-known langchain has a JS version, making it easier for front-end developers to enter AI development. Without further ado, let's get started!

Simple Usage#

Next, we will demonstrate how to use RAG through code. For simplicity, we will use the JavaScript version of LangChain and the llama3:3b model locally. You can replace it with other models as needed.

import { ChatOllama } from "@langchain/ollama";
import { StringOutputParser } from '@langchain/core/output_parsers'

const llm = new ChatOllama({
  model: 'llama3.2:3b-instruct-fp16',
  temperature: 0, // Controls the randomness of generated content
});

const res = await llm.invoke('Hello');

// Format the output text
const parser = new StringOutputParser();
const parsed = await parser.invoke(res);

console.log(parsed); 
// Output: Hello! I am your AI assistant, welcome to our conversation. How can I assist you?

Loading Documents#

Since our goal is to enhance search on our own data, the first step is to load documents. LangChain provides many methods for loading different documents (Document Loading), basically covering all documents available on the market. If you're interested, you can check it out. This time we are using web remote document loading.

import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";

const loader = new CheerioWebBaseLoader("https://goodcheng.wang/web3", {
  selector: 'article' // Specify the area to load content from
});

const doc = await loader.load();

Splitting Documents#

Why split documents? Directly embedding (embedding) the entire document seems simple, but it is actually inefficient. The reasons are as follows:

Memory and computational limits: Large-scale documents (like entire books, long reports, etc.) are usually very large. If you directly embed the entire document, the computational resources and memory consumption will be very high. An obvious issue is that the models we use do not support such large inputs.
Context window limits: Most existing natural language processing (NLP) models have a certain context window size limit when processing text, meaning the model can only handle a certain number of tokens. If the text exceeds this window size, the excess will be truncated or ignored, leading to a loss of a lot of information. Therefore, splitting the document into smaller chunks, with each chunk's length fitting the model's context limits, ensures that the model can fully utilize the input context information when generating embeddings.
Improving retrieval and matching accuracy: Splitting documents can also improve the accuracy of information retrieval. After embedding the entire document, when querying specific questions, the embedding may cover too much irrelevant information, leading to imprecise matches. In contrast, if the document is split and each segment is embedded, queries can retrieve only the most relevant paragraphs, resulting in more precise outcomes.
and so on

Here, we will split the text every 500 characters, with an adjacent 100 characters as a "chunk." If you're interested, you can explore this website yourself Text-splitter demo.

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

const splitter = new RecursiveCharacterTextSplitter({
  chunkOverlap: 100, // Number of overlapping characters in adjacent chunks, enhancing continuity
  chunkSize: 500, // Number of characters per chunk
});

const allSplits = await splitter.splitDocuments(doc);
console.log(allSplits[0]); // Print the first split chunk

Storing Content Embeddings in a Vector Database#

Next, we will embed the split document content and store it in a vector database. In this example, we use an in-memory vector database, which can be replaced with other storage solutions in practical applications. VectorStore

import { OllamaEmbeddings } from "@langchain/ollama";
import { MemoryVectorStore } from "langchain/vectorstores/memory";

const embeddings = new OllamaEmbeddings({
  model: "mxbai-embed-large", // Default value
  baseUrl: "http://localhost:11434", // Default value
});

const vectorStore = await MemoryVectorStore.fromDocuments(allSplits, embeddings)
const vectorStoreRetriever = vectorStore.asRetriever({
  k: 5,
  searchType: 'similarity',
})
const find = await vectorStoreRetriever.invoke("What is web3.0")
console.log(find)

Integrating into RAG Chain#

Finally, we will integrate all the above steps into a complete processing chain. First, we will write a prompt template for the generated chain:

import {ChatPromptTemplate} from '@langchain/core/prompts'

const ragPrompt = ChatPromptTemplate.fromMessages([
  [
    'human',
    `You are an assistant for question-answering tasks. Use the following retrieved context to answer the question. If you don't know the answer, just say you don't know. Use at most three sentences, and keep the answer concise.
    Question: {question}
    Context: {context}
    Answer:`
  ]
])

Then, link the process using RunnableSequence:

import { formatDocumentsAsString } from "langchain/util/document";
import { RunnablePassthrough, RunnableSequence } from "@langchain/core/runnables";

const runnableRagChain = RunnableSequence.from([
  {
    context: vectorStoreRetriever.pipe(formatDocumentsAsString), // Retrieved context
    question: new RunnablePassthrough(), // User question
  },
  ragPrompt, // Generated prompt
  llm, // Large language model
  new StringOutputParser(), // Parse the generated result
]);

const res = await runnableRagChain.invoke('What are the representative websites of web2.0?');
console.log(res); 
// Output: Facebook and Twitter are famous websites of Web2.0.

The results are quite good~

Conclusion#

Through RAG (Retrieval-Augmented Generation), we can enhance the accuracy of responses from existing large language models, especially in scenarios requiring the latest or domain-specific data. This article demonstrates the entire process from document loading, splitting, embedding to final generation using LangChain's JavaScript implementation.