What am I seeing here (in just one post):
- Other experimentations
- People asking for more details on OP’s project, OP answering questions about their approach
- Other repositories (AutoRAG) inspired by the OP’s problem space
- Others filling in gaps that OP identified as limitations of their work
- OP gets informed of other repositories, papers, and ideas that might make their project better.
This post does ask for feedback, but it is also much like spc and pe in that they are sharing trials and errors for a very specific but nascent task of doing RAG on an Obsidian vault and technical documents.
OP shared:
- Documents are best when it can be parsed, like markdown, HTML, or docx
- Splitting by logical blocks like headers improved quality, but it makes chunk sizes different
- Metadata should be included in the chunks, like document name, higher-level logical blocks, and chunking for code blocks
- Add metadata to vector store, like chunk size, document path, etc.
- You can test chunk sizes by selecting the size that achieves the best score.
OP mentions “Let me know if the approach above makes sense or if you have suggestions for improvement. I would be curious to know what other tricks people used to improve the quality of their RAG systems.”
In the comments, people indeed share their experiences with implementing RAG systems. One says they weren’t impressed by the super basic approaches, and tested embedding question and answer pairs. Others say this does add costs of building the index, but that it helps a lot. They simply ask GPT-4 to turn information in documents into many Q&A pairs, and do similarity search on the questions. This works on a more static knowledge base.
People also share papers on novel RAG methods like HyDE. And OP is given the opportunity to ask others who have already tried the strategy out, gaining anecdotal evidence of how well it works like the following:
I have! I get way better results with Hyde. On complex questions, I also get good results when I break the user’s query into X independent questions that need to be answered, and hit each independently.
Others simply express their gratitude:
“Thanks for the after action report! I’ve been hoping to solve problems like this myself, and it’s really nice seeing the way you handled it.”
The OP also engages in helping clarify certain parts of their RAG system. For example, they explain to someone asking why they use a reranker if they already go through the process of calculating documents with high cosine similarity; if the LLM has a limited context window such that it can fit 3 documents out of the 5 that are semantically similar, the reranker would pick the most relevant documents at the last stage.
One notable development within the comments is a shared concern that there are many people experimenting on how to make RAG systems better, but nobody (perhaps an overstatement) is gathering results in an objective way, and making a leaderboard. People are talking about building a list of Q&A pairs from source documents, and comparing different solutions to determine how closely the answers match desired outcomes. Then people talk about how Q&A processing is the hardest thing to solve, since LLMs can easily hallucinate what’s in the document according to how well they manage context.
It’s notable that a benchmark for comparing RAG is brought up. Posts like this (which I assume there are many more of) are places where people can subjectively share and compare different methodologies, but there still also is a demand for something more objective, and something that scales better. In fact, another person comments on this—it is how an open source community makes progress. There are dozens experimenting with RAG, but when people learn from early trials and errors to improve upon it, it allows more to learn from it, and the positive loop continues. They attribute the advancements that has already occurred with how people think about RAG to be a product of this positive loop. On a similar note:
Interesting take. I definitely agree with a more organized approach, though I like the creativity this chaos brings. If there were “default” approaches, imo we’d all be less inclined to try out new things. On the other hand, search has been around for 20+ years but hasn’t been “solved” once and for all. To me, this is search and then some. Why would you expect it to be “solved” in such a short period of time?
Noticed two very interesting user-recruitment comments:
Completely agree with you… I think there’s an opportunity to benchmark and create an easier set of tools to optimize around. We’d love to pick your brain on it as that’s what we’re building towards but are eager to get user’s input to what we’re building. Here’s a link to our userinterview panel where we’re running a $50 incentive to chat with us for 30 minutes: https://www.userinterviews.com/projects/5pvfCYnAnA/apply
u/greevous00 u/snexus_d or others on this thread — we’re a stealth funded working on improving and streamlining RAG implementation, QA benchmarking, flexibility, etc. and would love to chat with you on your experience. If you’re willing to, we’ve got a userinterview panel with a $50 incentive right now: https://www.userinterviews.com/projects/5pvfCYnAnA/apply
Some members share very recent work that actually implements benchmarks of RAG.
Wow, this post is actually what inspired AutoRAG! Quote,
“Hi! When I read your comment, it hit me so hard and made my mind to make this by myself. It is called AutoRAG. We made configuration YAML file for setting up RAG pipeline experiment, and automatically evaluate and find optimal RAG pipeline. With this, you can select an ideal base model, and benchmarking easily without struggling Langchain or LlamaIndex code again and again. Plus, since it is single YAML file, sharing RAG pipeline with each other is so much easier. I really hope AutoRAG can be one of the solutions to solve issues in RAG you mentioned… Please check this out and leave some feedbacks or comments. Thanks!”
Others leave anecdotes of their own experiments:
For code-bases specifically, following these practices really helps: * Include a ctags “tags” file in the RAG db * Include a description of framework and application file/folder conventions in the RAG db * Include as many software design docs as possible in the RAG db * Vector embeddings should include a header with the source filename and chunk number * Overlap vectored chunks by 10-20% of size * Two-pass query: make sure to include RAG source filename references in the first query output, then run the same query again with the previous response in the query context.
And some troubleshoot implementing what OP did to their own application:
Yeah I just incorporated one, and omg that changes a lot. I have been basically just giving the similarity search output with top 3-5. But I retrieved 10 chunks and incorporated a cross encoder. The change is so significant. I am not exactly satisfied by the way this crossencoder ranks, but it is better than what the similarity search does. Can you suggest me a cross encoder?
Others recommend libraries for the gap that OP was not able to fill, i.e. embedding PDF documents that aren’t structured like plain text ones.
Others pose new questions, like whether OP has any takes on how the project might be expanded when a user needs to ask follow-up questions on a query that might not have been included in the ranked documents from the previous query—how would it search for the missing data?