Warning: This post assumes you know what a GPT and LLM is and you’ve perhaps used one like one of the GPT models from OpenAI or Hugging Face.
This isn’t a GPT and LLM article for the sake of it, in fact, I recently built a language chain for a software documentation question & answer bot. My head hurt from banging it on a table, but I had some reasonable results. The biggest insight, despite feeling smug that it worked, was the way we write documentation today is set to change for good because of what GPT, ChatGPT and LLMs have already done to the way software engineers think about life. I won’t highlight any other text and apologise for the previous travesties. GPT has already impacted the software industry whether you like it or not. I’ve got all sorts of horrible views about how software will be like homeopathy: skills will be watered down in favour of highly celebrated mediocrity.
Building a Q&A Bot
GPT models have been pre-trained with lots of inputs, typically in the order of billions. Commercial vendors don’t share what they were trained on, but an LLM like GPT4 from OpenAI can not only learn new skills, but it can be create and generate new answers to questions it’s maybe never been asked before, all within the confines of a context. Scary eh. There is a catch though. You cannot retrain the LLM and many of these models are closed symbolic boxes; you can provide inputs and you get an output, but you can’t necessarily poke it with a hammer to change the innards permanently. Getting GPT to be a document bot therefore becomes challenge of information management.
As a rough guide to building a doc bot, step one is data ingestion; find your document source and split it into semantically meaningful chunks. Step two is to do a vector search for similarity of terms in your documents that have been split and return a sorted list of relevant documents from said search. The final step is to provide the most relevant chunks and the input query to GPT and if you’ve done a good enough job, GPT will provide you a friendly chat style response. Simple! But no.
LLMs have no memory, but they have the notion of a context, which has a maximum size. For instance, GPT3.5 has a context of about 4K tokens (a token is about 0.75 of a word). Therefore, you can provide chat history, or information like chunks of text for the LLM to search. Given the LLM will (probably) already have knowledge of your product providing it was around when the model was trained, the provided chunks will provide context for specific answers and if you’re loose with the inference settings, responses may have some synthesis (a la, made up tosh!).
The glaring issue is the size of the context. If I have a document base with 289 pages that’s been split into 1897 chunks of 1K tokens, that’s a whole lot of search space, which is why doing a vector search to narrow down the search space is important. After the initial vector search, GPT has an easier time of things and you can always make more than one interaction with a GPT and use also use it to amalgamate and sort answers.
The Importance of Text Splitting
Putting to one side the impressive nature of GPTs, the big takeaway for building things like documentation bots is that descriptions in paragraphs to be clearly aligned to quantifiable chunked boundaries that have some form of hermetically sealed meaning. The last part might not make much sense, if but if an important description crosses multiple chunks, the vector search might not yield anything useful for the LLM to read. Make sense? For GPT-3.5-Turbo from OpenAI, that means about 1K tokens because you want to provide more than one chunk of text to search for a given answer, typically 3 or 4. You also have to give the GPT space to return an answer and so thinking about writing documentation without fluff, repetition or long drawn out answers is important. As with everything in software, there are trade offs! In this case, it’s controlling the cost of token consumption without losing efficacy. In summary, it means writing succinct, clearly defined and highly packed, fluff-less chunks of text.
With the imminent general availability of GPT-4 with 8K and 32K context sizes, this issue is alleviated, but at a much (much) higher cost, over 10x. Beginner level documentation is likely to suffer from this style of information management, because they’re always wordy with explanations. Time to improve your writing skills!
The pendulum always swings however and for documentation bots with large search spaces, hosting your own commercial or open-source model with pre-trained weights will probably save you money in the long run. Just because OpenAI GPT-4 is the cool kid on the block, it doesn’t mean there aren’t alternatives.
TL;DR write documentation that aligns roughly to 750 word paragraphs and be clear with definitions, maintain clear descriptions and maintain internal links. If you’ve split your text correctly, an agent program can follow them and build larger answers.
They took our jobs!
No, you can sleep at night. Simple tasks for sure will be dealt with by LLMs, but you can carry on with your primary set of worries! Watch this great talk by Sebastien Bulbeck to put your mind at ease. Here’s an accompanying paper if you’re that way inclined: https://arxiv.org/abs/2303.12712