How we chunk

They are seeking to solve the problem of parsing and searching PDFs which is notoriously hard (tangential to Yet another RAG system). This makes it also eligible for spc.

They are dynamically generating chunks when a search happens, sending headers and sub-headers along with chunk/chunks that were relevant. This helps with the limitation of chunking mid-paragraph or sentence; semantic chunking failing with lists; LLMs doing better with structured data.

This is a product so they do not have a repo, but they say, “The product is still in beta so if you’re actively trying to solve this, or a similar problem, we’re letting people use it for free, in exchange for feedback.”

One comment is notable that they suggest a completely different approach (agentic), where you give the LLM a set of tools to interact with a PDF, like semantic and keyword search tools to enter the data corpus at good locations. Then they actually comment the entire approach, showing how you would specify the tools that the LLM can use.

One comment asks how they dealt with multi-level headers, merged cells, and subcategories, which is common in real-world tables. Here is a fun reply:

Primary dev here :). We try to detect individual cells first, then figure out which rows and columns they belong to. One of my favorite examples from an SEC filing: https://preview.redd.it/6137hdn6r09d1.png?width=1830&format=png&auto=webp&s=fc0e84bef706694ee334fa35f52b887542d5cb6f You can see a few things: 1) one of our remaining bugs is that superset text messes up row detection (check out the left-most columns) 2) The fiscal year parent header is above its children. We’re incorrectly seeing 3 columns in those headers, correctly seeing 2 columns in the table body, and correctly seeing spanning rows for “In millions,…” and “Change(3)” We ran into the exact same problem with spanning columns and rows for CSV / markdown outputs. We’re outputting tables as a flat list of cells. Each cell has an array of `rows` and an array of `cols`. I basically modeled it after HTML tables since they can represent just about anything. Our full JSON is pretty verbose, so we put a somewhat simplified value above. It includes bounding boxes for each cell, plus our row/column analysis. The analysis is based on the bounding boxes so if someone disagrees with our algorithm they can just write their own. Here’s the JSON for the table above: https://pastebin.com/ehSSTHum

Another intention comment from the OP:

Dev here. We’re sharing an overview of the approach that’s worked for us. Our solution isn’t open source. We may sell it, and we may open source it (TBD), but the goal of the post is to share things that have worked for us and things that haven’t (specifically, that whitespace analysis and non-neural-net approaches can go really far, and that a hierarchy is better than flat content for search). I’m planning on writing up a post that goes more into depth about our junk detection technique, but it was *way* too much content for a single post. Basically, if you create character ngram vectors for each block of text, you can use clustering with L1/Manhattan distance to super cheaply find nearly identical text within your document. If you have content blocks, you can run the whole thing with Pandas + Sklearn’s count vectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature\_extraction.text.CountVectorizer.html#sklearn.feature\_extraction.text.CountVectorizer) and DBScan (https://scikit-learn.org/stable/modules/clustering.html#dbscan) in a few lines of code. It’s really interesting to see why L1 distance works better than Cosine similarity (what folks default to with embeddings), and why sparse embeddings are really useful for this task. But, like I said, way too much for a single post.

This is an interesting instance where they do provide a detailed overview of their approach with RAG, but they aren’t necessarily sharing the code, and in some ways, are recruiting people who are interested in trying it out (for free in exchange for feedback).

Could you send me a dm with your use case & issues? I’d be happy to setup a call this week!

pstore

Explorer

How we chunk

Graph View

Backlinks