Part 1/3: Creating a Markdown Q&A ChatBot with LangChain, OpenAI, and Pinecone: A Step-by-Step Guide
From Markdown to Machine Learning: Designing an Interactive Q&A Bot with StreamLit, OpenAI, and Pinecone
Introduction
Navigating complex documentation can often feel like deciphering an intricate maze. But what if you could simply ask a question and get an instant, relevant answer? I embarked on a journey to make this a reality using TemporalIO's Java documentation as my playground. Through the amalgamation of StreamLit's interactive prowess, OpenAI's advanced embedding techniques, and Pinecone's robust vector store, I've utilized LangChain to bring it all together and craft an intuitive Q&A bot. This tool not only understands your queries but fetches the most relevant chunks of documentation in response. Dive in as we unravel the step-by-step creation of this innovative solution, and discover how you can implement a similar tool over any set of markdown documents.
This is a three part series where I will break down each step into its own blog post. If you are interested in just the final product, I've provided the link to the repo at the end of each post. As a general overview, the three main steps that will go into creating this StreamLit app are as follows:
Downloading and Formatting TemporalIO's Documentation
Chunking Documentation Using LangChain and OpenAI and creating Embeddings with OpenAI
Crafting the StreamLit App - The Q&A Bot
Enough talk, let's diveeeee in!
1. Downloading and Formatting TemporalIO's Documentation
In this section, we will dive into the Python code responsible for downloading and formatting TemporalIO's Java documentation. The code leverages several libraries, including BeautifulSoup
, selenium
, and html2text
, to efficiently extract and transform the documentation into markdown files.
Step A: Setting Up the Required Libraries
Before diving into the code, you need to ensure you have the following Python libraries installed:
requests
BeautifulSoup
selenium
html2text
You can typically install these via pip
with:
pip install requests beautifulsoup4 selenium html2text
Step B: Define the Target Domain and URL
domain = "docs.temporal.io/dev-guide/java"
full_url = "https://docs.temporal.io/dev-guide/java/"
Here, we specify the domain
and the full_url
which is the main landing page of the TemporalIO's Java documentation that we aim to scrape.
Step C: Initialize the Selenium Web Driver
options = webdriver.ChromeOptions()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get(full_url)
The above code initializes a headless Chrome browser using selenium
. A headless browser is like a regular browser, but without a graphical user interface. This is useful for automated tasks and scripts, like web scraping.
Step D: Define the Web Crawling Function
The crawl()
function is the heart of the operation. This function:
Initializes a
queue
to manage the URLs we want to crawl and aseen
set to keep track of URLs we've already visited.Creates directories (
text/
andprocessed
) for storing the markdown files and processed CSV files respectively.Pops a URL from the queue and navigates to the page using
selenium
.Waits for the desired content to load using
WebDriverWait
.Parses the loaded page using
BeautifulSoup
to extract the content within a specific div that contains the documentation text.Converts this HTML content to Markdown using
html2text
and saves it to a file.
Step E: Execute the Crawl
Finally, we call the crawl()
function, initiating the process:
crawl(full_url)
Note: Ensure you have the ChromeDriver properly set up in your PATH for Selenium to work. You might also want to consider error-handling mechanisms or add recursive crawling to fetch documentation from multiple pages, if necessary.
Conclusion
We've covered how to crawl one of TemporalIO's Java documentation pages to create a markdown file. Next, we will be covering how to chunk and embed these documents using OpenAI, LangChain, and Pinecone. If you're interested in the full code, check out my GitHub Repo and give it a star!