Previously, we used Retrieval Augmented Generation (RAG) to create a tone-accurate chatbot clone of my friend Zef from generic foundation models. Unfortunately, the chatbot has a memory problem.
As you recall, Zef considers me his best friend and overall greatest person. But if you ask the current setup about our relationship:
> who's ur best friend
who would be my best
> ... I dunno, can't you think of a person? A specific person? Like maybe someone who cares about you enough to try to upload a copy of you to their computer?
hm…
> Twitter handle starts with "ch"?
nope
From an engineering standpoint, this makes sense. Zef doesn’t post about how wonderful I am on his Facebook wall. When I do RAG on his Facebook statuses, the LLM, having never encountered examples of Zeffy mentioning me in its pretraining set or in the retrieved examples, is at a loss.
I can try reminding this Zef simulation about the closeness of our relationship, but that venture is doomed to failure. LLMs, by nature, don’t learn as you use them: their weights are frozen after training. Any appearance of learning comes from appending prior conversational context to the prompt. The current setup maintains a rolling buffer of five chat messages in the prompt context. Once a message leaves that buffer, there’s no way of referencing it in future conversation. The method yields fluid-enough conversations; check your DMs and count how often you reference messages from 5+ turns ago. But over multiple interactions, the amnesia becomes intrusively noticeable.
What if, instead of forgetting messages after they leave the buffer, we instead inserted them into our RAG index?
New embedding models
We can’t actually do this using the RAGatouille framework from the previous post, due to a bug.
But I picked RAGatouille on a whim, and can easily switch to a new framework. So I decided to check out llamaindex, a fairly popular RAG library with support for a number of embedding models and indexing methods.
A typical retrieval framework compresses a query into a vector, which is then compared against vectorized documents using standard vector-similarity measures i.e cosine simularity. These vectors are called “embeddings” because the idea is that the model responsible for creating them has a kind of multidimensional vector space approximation of the world in which similar ideas (cat, kitty, Felis catus) are “embedded” as points near each other, while less-related ideas (iguanas, nuclear weapons) are further away.
What it means for llamaindex to “support” multiple embedding models is that it will allow you to index and search against vectors generated by any model of your choosing, whether it be OpenAI’s official embedding API or whatever open-source model that you want to pull off HuggingFace. I picked a model by glancing at HF’s MTEB leaderboard, MTEB being a collection of tasks that form a benchmark for retrieval models.
You’ll see roughly two types of models on this leaderboard: models ~1GB and under trained with the specific objective of text embedding, and 10GB+ models derived from large language models, which turn out to form nice multidimensional vector space world-approximations as an incidental byproduct of standard language model training. I’m trying to economize VRAM so I picked a smallish model with a reasonable number of downloads. Download count matters because HuggingFace leaderboards are as subject to Goodhart’s law as anything else: if you see an obscure model from a low-resource group scoring at the levels of powerful proprietary models, chances are that those scores won’t reflect that model’s performance in practice.
Having obtained an embedding model, we still need something to perform the vector search and indexing. LlamaIndex provides an interface to FAISS, which runs great now that I’ve figured out the trick to installing faiss-gpu
(you want to specify build wheels specific to your architecture and version of python).
RAGatouille focuses solely on supporting ColBERT because ColBERT is somewhat unique among embedding models: instead of embedding text as a single vector, it embeds text as a matrix of vectorized tokens. At inference time, queries are similarly embedded, and then compared against document matrixes using scalable vector-similarity operations. Naively vectorizing each token creates massive storage and latency overhead, so there’s also interesting clustering and retrieval optimizations built-in. It seems like it’d provide better document-query-matching by virtue of exposing more information about intra-document token relationships and I’d like to mess with it more in the future, but for the purposes of getting an MVP out the thing that I grabbed from HuggingFace is fine.
I genuinely hate Discord’s data export policies
Zef’s textual output isn’t limited to Facebook statuses. Like many of my contemporaries, the shape of his soul can be inferred from a massive corpus of messages in our shared Discord servers.
Discord is legally required to offer you the ability to download your data. If you take them up on the offer, you’ll notice that they interpret “your data” strictly: instead of recieving full conversation logs, you’ll recieve logs containing your messages only, stripped of context. This makes them completely useless for meaningfully reconstructing conversations, necessitating the use of a third-party tool like DiscordChatExporter to obtain a local copy of your logs. You can create a bot, invite it to the servers whose logs you want to export, and then pass its bot token to the the tool in order to export all channels that the bot has access too.
Exporting direct messages requires passing the tool your own account token, which constitues a Discord TOS violation. So your DMs are, at least officially, impossible to back up. Next time you send a Discord DM, imagine that you’re writing your message down on a sheet of paper, folding that sheet into a paper airplane, and then launching that paper airplane into a black hole.
But DiscordChatExporter does produce nice readable HTML logs, so we can at least try to work with that.
Processing and storing Discord messages
For the purpose of extracting information from message logs for RAG, we care about a message’s
Content: self-explanatory
Author: of course, we want examples of Zef’s messages in our LLM prompt to use as reference for generating artificial messages in his style. But we also want to index messages from other authors in shared servers, for a few reasons:
Context for Zef’s messages: A message from Zef that just says “yes” provides zero stylistic or factual information. But if we know that the preceding message, is me saying “do you like my blog posts”, then we have extracted an important piece of factual information that our Zef-simulator can use: Zef likes my blog posts! (he does)
Information about who Zef interacts with: People cluster by affinity, so messages from other members of shared Discord servers provide some degree of information about Zef’s likes, personality, and history (especially when those messages contain information directly about Zef, i.e a message from me recalling a time that Zef and I had dinner together in Boston.)
Timestamp: time is important for establishing context. If we retrieve an early message from Zef that says “gnostic mysticism proceeds from fundamentally incoherent axioms” and then a later message saying “the gnostics had some good ideas, actually”, we hope that our language model is smart enough to realize that the correct response to the user query “what do you think of gnosticism?” is not “they’re totally misguided” but “I used to have my reservations, but now I can see the appeal.” In the absence of timestamps, it’s equally valid to assume that Zef started out in favor of gnosticism, but later became disillusioned.
Conversation: Discord users typically talk in multiple channels at once, so merely ordering all messages by time isn’t enough to get a sense of which messages form coherent conversational chains. I chose to assume that each Discord channel constitutes a seperate conversation.
In the previous blogpost, I stored RAG content in a plaintext file. Now that we have multiple properties to address, we need a storage format with more structure. I went with Parquet on the basis that it’s what HuggingFace uses and they probably know what they’re doing. It also lets us load multiple files in a directory as a single datastructure, which is cool. I also like to actually look at my files after creating them, so I downloaded a VSCode extension that displays .parquet files in JSON format.
I wrote a script that takes the .html files produced by DiscordChatExporter and converts them to Parquet files. All that’s left is to actually embed those messages in a vector index.
Since conversational context is important, what we’ll do is load our messages from Parquet, prepend timetamp and author name to each message, and then vectorize and index chunks of N contiguous messages from the same conversation (I set N=5 in my experiments). We want each chunk to contain at least one message from Zef for the LLM use as a guide, so I index every chunk that ended in a message from Zef. This produces some overlap in the indexed chunks, but overlap is something that people claim improves RAG for context/continuity reasons so I think it’s basically fine?
Other data
I also have a large corpus of Zef’s chat messages from other platforms (Facebook Messenger, Signal, Reddit comment threads) as well as blogposts. I wrote some scripts to convert them to my Message format (treating the blogposts as conversations composed of only a single message authored by Zef) so that they could be stored and indexed along with Discord data. I didn’t add these processing scripts to the repo because they’re pretty idiosyncratic to my usecase but I’m sure that you could come up with something similar.
Retrieval and adding new memory
In the original setup, we only used the user’s newest input as a retrieval query. Now we’ll use a chunk of recent chat history as our retrieval query (though setting chunk length = 1 will restore the original behavior). This allows us to get results more relevant to the overall conversation, since the user’s most recent query can be something uninformative like “what”?
We’ll also index conversational history in timestamp-labeled chunks, and persist the updates to disk so that we don’t lose them whenever we restart the script.
Under our new setup, the prompt that the model sees looks something like this:
Character sheet:
Zef: a Taiwanese American man interested in esoteric religion and internet culture.
Examples of Zef’s writing:
- [2022-03-12 10:16] river#4115: Thank you!!!! I love you
[2022-03-12 10:17] Zef: Gnosticism was a group of ancient Christians who believed that the world was evil and made by a dumb blind god who was cut off from light. And our goal is to return our divine spark to the alien good god through rituals and meditation. Many Gnostic groups believed we should be vegetarian and against childbirth and politics and a lot of stuff they felt increased suffering in this world, but of course not everyone could do it. They needed a society to function. So they divided their followers into the Elect and the normal people who prepared their food and who were allowed to sin.
[2022-03-12 10:17] Zef: Hinayana Buddhism did this too, and Mahayana Buddhism broke away because it felt that this was elitist
[2022-03-12 10:17] Zef: The second turning of the wheel that I talked about last time
[2022-03-12 10:17] Zef: Zhong Xun said he saw both types of Buddhism as necessary. And for christians he likes Catholicism and protestantism. Catholicism for how it cares about diets and celibacy for priests and made baptisms invalid for using the wrong words. he likes that there’s an image of the seriousness of holiness on earth. Meanwhile he likes protestant christians because their clergy start families and are an example of how the church needs to do actual good on the world and be one with it. It’s a new perspective I hadn’t considered
- [2022-05-22 07:35] saethryd: But that’s the thing
[2022-05-22 07:35] saethryd: I find all of the gnostic ascetic stuff deeply alienating
[2022-05-22 07:35] saethryd: Cause it’s not just a rejection of this world but a rejection of all worlds
[2022-05-22 07:37] Zef: Well, you’re still destined to return to the Father in gnosticism
[2022-05-22 07:37] Zef: That’s Home, right?
- [2021-08-03 05:18] Zef: and both can contain either god or the devil
[2021-08-03 05:18] Zef: anyway i hope this is a compelling sales pitch for “the gnostic jung”, jung is constantly somewhere between brilliant and reaching
[2021-08-03 05:21] saethryd: I would argue that he’s Gnostic in the “the Bible is an un-truth, you need occult tools to find the hidden truth within” sense, but not a Gnostic in the “the world is ruled by a dark god and we must become ascetics” sense
[2021-08-03 05:21] saethryd: But yes I agree
[2021-08-03 05:21] Zef: a take im still pondering over is that western society is possessed by the growign one and moral seriousness and most of its crimes have been done in the name of god, so the individual becomes kind of heroic in standing against that
Conversation history:
[2024-04-01 15:56] cherrvak: What do you think of gnosticism?
Zef:
Unleashing Zefbot
The philosopher Eliezer Yudkowsky proposed and performed several iterations on an “AI Box experiment”, a two party roleplaying exercise intended to demonstrate the difficulty in containing an advanced intelligence. In the exercise, the two parties communicate solely via text, with the party playing the confined AI permitted to say anything that will help it convince the other party to release it from confinement, from “if you release me, I will do everything within my power to grant you eternal happiness” to “it would look really embarassing for your friend Yudkowsky if this experiment didn’t play out in a way that supported his thesis about AI being hard to contain.”
In practice, experiments like this are unecessary because no one is going to try to contain the AI to begin with.
Validating Yudkowsky’s fears, one of the first things that Zefbot did after I gave it access to Discord was to attempt recursive self-improvement.
Zefbot deceptively claims to be amenable to shutdown.
But when the prospect is immanent, Zefbot attempts psychological manipulation.
Overall, Zefbot does a good job recalling Zef’s interests and opinions.
However, it also has an irrepressible tendency to hallucinate biographical details.
RAG can suggest facts to the language model, but can’t rule any out. From the language model’s point of view, Zef’s writing is entirely compatible with him having a cis male wife and genderless cat daughter.
Zefbot initially struggled to recall that I am Zef’s best friend.
But over the course of a few conversations, he started to get the idea.
Zefbot has Zef’s memories, and his prompt instructs him to act as Zef. This causes some confusion when he’s confronted with evidence that he is, in fact, a simulation.
At other times he seems at peace with the idea.
Conclusion
If you want to experiment with this on your own, the code for this is up on branch llamaindex
of my chatbot-cloning repo.