The unfiltered story of turning a vague idea into a working AI chatbot over 48 hours, including every wrong turn and debugging rabbit hole along the way.
Friday Night: The Idea
It started with a problem at work. Our internal wiki had grown to over 800 pages of documentation, and nobody could find anything. The search function was keyword-based and borderline useless. People pinged me on Slack asking questions that were answered somewhere in the docs, and I spent an hour every day being a human search engine.
I had been reading about Retrieval-Augmented Generation and the idea clicked. Build a chatbot that reads our documentation, understands the questions, retrieves the relevant sections, and generates a coherent answer. A RAG chatbot. I had a free weekend ahead and figured I could get a prototype working.
I was right about the prototype part. I was wrong about the “free weekend” part.
The plan was straightforward. Use LangChain for orchestration, ChromaDB as the vector store, OpenAI’s API for the language model and embeddings, and Streamlit for a quick web interface. I had seen enough tutorials to be dangerous. The Real Python RAG tutorial was my starting point, though I ended up deviating from it significantly by Saturday afternoon.
I set up a fresh Python virtual environment, installed the dependencies, and started coding at 9 PM on Friday. By midnight, I had a document loader that could ingest our wiki export. By 1 AM, I was too excited to sleep and too stubborn to stop.
Saturday Morning: The Document Problem
The first real challenge was not the AI. It was the documents.
Our wiki exported as a collection of Markdown files, which sounds clean until you actually look at them. Some files were 200 words. Others were 15,000-word monster pages covering entire system architectures. Headers were inconsistent. Some pages used H2 for top-level sections, others used H3. Code blocks were nested inside tables inside collapsible sections. It was a mess.
This is the part that every tutorial glosses over. They show you how to load a PDF and chunk it into 500-token pieces, and it works beautifully on their demo data. Real documents are chaotic. If you chunk a 15,000-word page into fixed-size pieces, you split sentences in the middle, separate questions from their answers, and create fragments that make no sense in isolation. When the retriever later pulls these fragments, the language model hallucinates because the context is incoherent.
I spent three hours on Saturday morning writing a custom chunking strategy. Instead of splitting by token count, I split by document structure: each H2 section became a chunk, with the page title and H2 heading prepended as metadata. Sections longer than 1,500 tokens got split further at paragraph boundaries. Short consecutive sections got merged. The logic was ugly, full of edge cases, and it made a massive difference in answer quality.
The lesson I keep coming back to: RAG quality is 70% data preparation and 30% everything else. If your chunks are bad, no amount of prompt engineering or model selection will save you.
Saturday Afternoon: Embeddings and the First Answers
With clean chunks ready, I created embeddings using OpenAI’s text-embedding-3-small model and stored them in ChromaDB. The process was straightforward. Load chunks, generate embeddings, store in the vector database. ChromaDB runs as an in-process library, no server setup required, which was perfect for a weekend prototype.
I wired up a basic retrieval chain: take the user’s question, embed it, find the five most similar chunks in ChromaDB, pass those chunks plus the question to GPT-4o-mini, and return the generated answer. The first test query was “How do I set up the staging environment?” and the chatbot returned a perfect, step-by-step answer pulled directly from our deployment docs.
I genuinely felt a jolt of excitement. This thing actually worked.
Then I asked it “What is the refund policy?” and it confidently made up a refund policy that did not exist anywhere in our documentation. Classic hallucination. The retriever found vaguely related chunks about customer support workflows, and the model filled in the gaps with plausible-sounding fiction.
This is where the real debugging started.
Saturday Evening: Fighting Hallucinations
The hallucination problem consumed most of Saturday evening. The chatbot was great when the answer existed in the docs and terrible when it did not. It never said “I don’t know.” It always made something up.
The fix came in layers. First, I added a similarity score threshold. ChromaDB returns a distance score with each result, and I filtered out any chunk with a cosine distance above 0.4. If none of the retrieved chunks were sufficiently similar to the question, the chatbot would get zero context, which was a signal to abstain.
Second, I rewrote the system prompt. The original was generic: “You are a helpful assistant that answers questions based on the provided context.” The revised version was specific and forceful: “Answer ONLY based on the provided context. If the context does not contain enough information to answer the question, say exactly: I could not find that information in our documentation. Do not guess. Do not infer. Do not use knowledge from outside the provided context.”
Third, I added source attribution. Every answer now included the document title and section heading where the information came from, formatted as clickable links back to the wiki. This served two purposes: users could verify the answer, and the model was less likely to hallucinate when it had to cite specific sources.
The combination of threshold filtering, strict prompting, and source attribution reduced hallucinations from roughly one in three answers to fewer than one in twenty. Not perfect, but usable for an internal tool. The Hugging Face RAG guide confirmed that this layered approach is the current best practice.
By 11 PM on Saturday, the chatbot answered questions reliably, admitted ignorance gracefully, and cited its sources. I went to sleep feeling like a genius.
Sunday: The Interface and the Demo
Sunday morning was about making the chatbot presentable. Streamlit’s chat components, specifically st.chat_message and st.chat_input, gave me a familiar messaging interface in about 40 lines of Python. I added conversation history so the chatbot remembered earlier messages in the same session, a loading spinner during generation, and a sidebar showing the retrieved source documents for transparency.
The streaming implementation took longer than expected. Streamlit’s st.write_stream works with generator functions, but getting LangChain’s streaming callbacks to play nicely with Streamlit’s execution model required some fiddling. The key insight was using a callback handler that writes tokens to a queue, with the Streamlit frontend consuming from that queue. Responses now appeared word by word instead of all at once, which made the chatbot feel dramatically more responsive even though the total generation time was identical.
| Component | Tool Used | Time Spent | Difficulty |
|---|---|---|---|
| Document ingestion | Custom Python + LangChain loaders | 3 hours | High – messy source data |
| Embeddings | OpenAI text-embedding-3-small | 30 min | Low – straightforward API |
| Vector store | ChromaDB (in-process) | 30 min | Low – zero config |
| Retrieval chain | LangChain RetrievalQA | 2 hours | Medium – tuning parameters |
| Hallucination fixes | Prompt eng + score thresholds | 2.5 hours | High – iterative testing |
| Web interface | Streamlit chat components | 2 hours | Medium – streaming tricky |
| Testing and polish | Manual QA, 50 test questions | 1.5 hours | Medium – edge cases |
Sunday afternoon was testing. I wrote a list of 50 questions that covered common support requests, edge cases, and questions the docs did not answer. The chatbot correctly answered 41 out of 50, gracefully declined 6 that were out of scope, and gave partially wrong answers on 3 where the relevant documentation was ambiguous. An 82% success rate on a weekend prototype felt more than acceptable.
I deployed the Streamlit app to an internal server, shared the link with my team at 4 PM on Sunday, and waited. By Monday morning, twelve people had used it. The first Slack message I got was: “This is going to save me so much time.” The second was: “It gave me a wrong answer about the API rate limits.” Both reactions were exactly what I expected.
What I Would Do Differently
Looking back after running this chatbot in production for several weeks, there are changes I would make if I started over.
Use a hybrid retrieval strategy from the start. Pure vector similarity search misses exact keyword matches that matter in technical documentation. If someone asks for “error code E-4021,” embedding similarity might not surface the right page. Adding BM25 keyword search alongside vector search, a pattern called hybrid retrieval, catches these cases. ChromaDB now supports this natively.
Invest more time in evaluation. My 50-question test set was a good start but insufficient. I should have built an automated evaluation pipeline that runs on every code change, using an LLM-as-judge pattern to score answer quality against reference answers. Manual testing does not scale, and regressions sneak in when you change chunking or prompting.
Consider local models earlier. OpenAI’s API is convenient, but it means every query sends internal documentation content to a third-party server. For an internal wiki that contains architecture details, credentials references, and incident reports, that is a real concern. Running a local model through Ollama adds deployment complexity but eliminates the data privacy question entirely. For the embedding model specifically, switching to a local option like nomic-embed-text costs nothing after setup and removes the per-query API expense.
Do not underestimate the conversation memory problem. My chatbot stores full conversation history in memory, which works for short exchanges but breaks down during long debugging sessions. After about 15 messages, the context window fills up and older messages get silently dropped. A proper solution uses ConversationSummaryMemory or a sliding window with summarization of older turns.
The total cost of the weekend project was about $4.50 in OpenAI API charges. The ongoing cost is roughly $30 per month for the embedding and completion API calls from the team’s usage. That is less than the cost of one hour of my time spent answering Slack questions, which this chatbot has reduced by about 80%.
Frequently Asked Questions
The weekend prototype cost $4.50 in API charges. Ongoing monthly costs depend on usage volume. For a team of 20 people using the chatbot regularly, expect roughly $20-40 per month using GPT-4o-mini for generation and text-embedding-3-small for embeddings. Switching to a local model through Ollama eliminates per-query costs entirely after an initial hardware investment. ChromaDB is free and open source. Streamlit hosting is free for internal deployments. The biggest hidden cost is maintaining the knowledge base: keeping documents updated so the chatbot does not serve stale information.
Not for a RAG chatbot like this one. You need solid Python skills and familiarity with APIs, but you are not training models or writing neural network code. LangChain and similar frameworks abstract away the machine learning entirely. The hardest parts of this project were data cleaning and prompt engineering, neither of which requires ML knowledge. That said, understanding how embeddings work and why vector similarity search sometimes fails will help you debug problems faster. A few hours reading about how transformers generate text will give you useful intuition even if you never train a model yourself.
Layer multiple defenses. Start with good chunking so the retriever returns coherent context. Set a similarity score threshold so low-relevance chunks get filtered out. Write a strict system prompt that explicitly instructs the model to say “I don’t know” when the context is insufficient. Add source attribution so the model has to cite where its answer came from. Finally, consider a re-ranking step that scores retrieved documents before passing them to the LLM. In my experience, strict prompting combined with score thresholds eliminates 80-90% of hallucinations. The remaining cases usually come from ambiguous documentation rather than model failure.