Part 2/3: Creating a Markdown Q&A ChatBot: Chunking Documentation Using LangChain and OpenAI; then creating Embeddings

In part one of this series, we downloaded the TemporalIO Java documentation and transformed it into markdown files. If you haven't checked that out, go see this post. In this part, we will focus on chunking the documentation using LangChain and OpenAI and then we will create embeddings to store in our Pinecone index. If you're only interested in the full code, check out my GitHub repo and give it a star!

Let's get this show on the road...

2a. Chunking Documentation Using LangChain and OpenAI

In this section, we will dive into chunking the documentation up using LangChain's MarkdownHeaderTextSplitter. Because the documentation has been transformed into a markdown file, we can decide how we want to split up the chunks. I decided to split up the chunks by headers.

Step A: Importing Necessary libraries

from langchain.text_splitter import MarkdownHeaderTextSplitter
import os

We import the MarkdownHeaderTextSplitter class from the langchain.text_splitter module. This class helps in splitting a markdown document based on its headers.
The os module is imported, which will be used to traverse directories and handle files.

Step B: Choosing what to split on

I decided to choose to split the documents up by the # Header 1 and ## Header 2:

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2")
]

Step C: Get all the MarkDown files

def get_markdown_files():
    markdown_files = []
    for root, dirs, files in os.walk("text"):
        for file in files:
            if file.endswith(".md"):
                markdown_files.append(os.path.join(root, file))
    return markdown_files

We initiate an empty list markdown_files to store paths of all markdown files.
Using the os.walk("text") method, we traverse the "text" directory.
For every file we encounter, if its extension is ".md" (indicating it's a markdown file), we add its full path to the markdown_files list.
The function returns the populated markdown_files list.

Step D: Split the MarkDown files

def split_markdown_files(markdown_files):
    markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    split_docs = []
    for file in markdown_files:
        # Read the markdown file
        with open(file, "r") as f:
            markdown_text = f.read()
            md_header_splits = markdown_splitter.split_text(markdown_text)
            split_docs.append(md_header_splits)
    return split_docs

We initialize an instance of MarkdownHeaderTextSplitter named markdown_splitter with our defined headers.
split_docs is an empty list to store split versions of our documents.
We loop over each markdown file in the markdown_files list.
- Within the loop, each markdown file is opened and its content is read into the markdown_text variable.
- We then use markdown_splitter.split_text(markdown_text) to split the markdown content based on the specified headers.
- The split content (md_header_splits) is appended to the split_docs list.
Finally, the function returns the split_docs list, containing split versions of our original markdown documents.

2b. Create embeddings to store in Pinecone

In this section, we will discuss how to create embeddings and store them in our Pinecone vector store.

Step A: Import necessary libraries

import importlib
import openai
import pinecone
import os
from dotenv import load_dotenv
from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings

importlib is for dynamic module import functionality.
openai and pinecone are SDKs for OpenAI and Pinecone respectively.
os allows interaction with the operating system.
dotenv lets you load environment variables from a .env file.
The last two imports are specific functionalities from the langchain library related to storing vectors in Pinecone and embeddings using OpenAI.

Step B: Set API Keys from env variables

load_dotenv()
openai.api_key = os.environ["OPENAI_API_KEY"]

Loads environment variables from a .env file using load_dotenv().
Sets the OpenAI API key.

Step C: Initialize and prepare Pinecone index

pinecone.init(
    api_key = os.environ["PINECONE_API_KEY"],
    environment = os.environ["PINECONE_ENV"]
)

index_name = os.environ["PINECONE_INDEX_NAME"]

if index_name not in pinecone.list_indexes():
    print("Index does not exist, creating it")
    pinecone.create_index(
      name=index_name,
      metric='cosine',
      dimension=1536  
    )

# Initialize OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

# import the chunk-docs.py file
chunk_docs = importlib.import_module('chunk-docs')

# Get the markdown_files and then the chunks
markdown_files = chunk_docs.get_markdown_files()
split_docs = chunk_docs.split_markdown_files(markdown_files)

This initializes Pinecone with the necessary API key and the environment, both sourced from environment variables.
We get the index_name from the environment variables.
We then check if the Pinecone index with that name already exists.
If not, a new index is created using pinecone.create_index() with the given name, cosine metric, and a specified dimension.
Initialize the OpenAIEmbeddings
import the chunk-docs.py and call the necessary functions to get the chunked Documents

Step D: Process and store in Pinecone Index

for doc in split_docs:
    vector_store = Pinecone.from_documents(doc, embeddings, index_name=index_name)

The documents are passed to Pinecone.from_documents() along with the OpenAI embeddings and the index name.
This method presumably processes each document, generates embeddings using OpenAI, and stores those embeddings in the specified Pinecone index.

Conclusion

To summarize, the code we've written:

Detects all markdown files within the "text" directory.
Splits each of those markdown files into chunks based on the specified headers (in this case, # and ##).
Returns a list of these split documents for further processing or analysis.
Retrieves and splits markdown files into text chunks.
Converts those text chunks into embeddings using OpenAI.
Stores the embeddings into a Pinecone index for further retrieval and similarity operations.

Next, we will be covering how to put this all together and create a StreamLit app where we can chat with the embedded documents! If you're interested in the full code, check out my GitHub Repo and give it a star!