Search pipeline¶
Full-text first. Vectors only rerank. If it’s not there, we say so.
luplo’s search is deliberately unromantic. It is a four-stage pipeline
where Postgres tsquery does the retrieving, the glossary does the
rewriting, and vectors — when present — only re-order the candidates
that tsquery already found. If retrieval finds nothing, the answer is
nothing — not a synthesized guess.
Query dialect¶
The search surface accepts a tiny web-search-style dialect. The grammar is intentionally small; nothing else parses.
Syntax |
Meaning |
Example |
|---|---|---|
|
required term (AND-joined with siblings) |
|
|
adjacent-word match |
|
|
disjunction (the literal word |
|
|
negated term |
|
|
negated phrase |
|
Clauses are AND-joined by default. An OR run folds its neighbours
into a disjunction group. Glossary expansion is applied only to
required and OR-group terms — phrases and negated tokens pass
through literally (expanding a negation would silently re-include the
concept the user excluded).
What is not supported¶
Parentheses for grouping. Nested tsquery-native syntax like
(!A & B) & !Cdoes not parse. Use De Morgan’s laws to rewrite:(!A & B) & !CbecomesB -A -C;(A | B) & !(C | D)becomesA OR B -C -D.Operators as literals.
&,|,!,(,)are passed through as part of the word and almost always return zero hits under thesimpledictionary. Avoid typing them unless you actually want the character in the lexeme.Regex / fuzzy matching. Out of scope — see the philosophy doc’s honesty-over-coverage commitment.
Worked example¶
The filter “women who are not men and not in their 50s”:
# Literal tsquery you might type: (!남자 & 여자) & !50대
# luplo dialect equivalent:
여자 -남자 -50대
Both describe the same set. The dialect version parses; the literal
tsquery version does not. If the negated-OR case (!(A & B), which
cannot be rewritten with plain negation) starts coming up, open an
issue — parens support is a declared v0.7 candidate, not a forever
“no”.
The four stages¶
user query
│
▼
┌───────────────────┐
│ 1. Context router │ project_id auto-filled from .luplo,
│ │ optional system filter
└────────┬──────────┘
▼
┌───────────────────┐
│ 2. Glossary │ normalise → strict alias expansion:
│ expansion │ "auth" → (auth | authentication | sign-in)
└────────┬──────────┘
▼
┌───────────────────┐
│ 3. tsquery │ Postgres full-text search over
│ retrieval │ items.ts (GIN index scan)
└────────┬──────────┘
▼
┌───────────────────┐
│ 4. Vector rerank │ OPTIONAL. pgvector cosine similarity
│ (optional) │ reorders the tsquery candidate set.
└────────┬──────────┘
▼
results
(with explicit match reasons: which aliases matched which fields)
1. Context routing¶
Every search carries a project_id. In the CLI it comes from .luplo;
in MCP it comes from the tool argument. The router can also apply a
system filter so that “auth” queries do not drag in items from the
“rendering” system. Nothing magical — just scoped WHERE clauses.
2. Glossary expansion¶
The glossary is the mechanism that lets "auth" find items indexed
under "authentication" or "sign-in" without requiring the caller to
know every alias. The pipeline is strict-first, with three layers:
Deterministic normalization. Lowercase, whitespace collapse, Korean morpheme splitting. Zero false positives — this step is purely mechanical.
Strict LLM matching. Only translation-grade synonyms (
sign-in↔login) make it into a glossary group. The prompt is tuned so that NONE is a better answer than a wrong grouping.Human curation queue. Candidates the LLM is unsure about go into
glossary_termswithstatus='pending', visible vialp glossary pendingandluplo_page_sync.
The query is rewritten into a tsquery expression that ORs the aliases within each matched group:
"auth rate limit"
→ (auth | authentication | sign-in) & ("rate limit" | throttle | quota)
Every alias that fires is recorded with the result, so downstream tooling can show why a match was returned.
is_protected — when clustering should stay away¶
The LLM flags terms as is_protected=true whenever they look like
identifiers that must never be pulled into a synonym cluster:
Upper-case acronyms (
JWT,API,OTP,TLS,RPC)Programming identifiers (
snake_case,camelCase)Proper nouns not in the general dictionary — project codenames like
Prometheus,Sentinel,Gatekeeper
Protected terms participate in exact match and deterministic
normalization, but the strict LLM step refuses to cluster them with
anything else. This is the guardrail that keeps OTP from becoming an
alias of opt or Sentinel from being merged with guard.
3. tsquery retrieval¶
Retrieval runs against the generated items.ts column — a concatenation
of title, body, rationale, alternatives, tags, and (when present) a few
context fields — indexed by GIN.
The result limit at this stage is intentionally generous (default:
fetch limit × 4 candidates) so the vector reranker, when enabled, has
room to re-order.
4. Vector reranking (optional)¶
When the vector-local extra is installed and pgvector is available,
each candidate’s embedding is compared to the query embedding via
cosine similarity. The top limit results after reranking are returned.
Vector search never originates candidates. If tsquery returns zero rows, vector rerank has nothing to do, and the search returns empty. This is the honesty rule — see Philosophy — encoded in code.
Embedding backends¶
Three drop-in backends exist:
Backend |
Dimensions |
When to use |
|---|---|---|
|
— |
Don’t want the ML dependency. Search is still glossary + tsquery, just no rerank. |
|
1024 |
|
|
1024 |
Call an external embedding service. For deployments where the worker can afford it but clients can’t. |
Switching backends does not re-embed history automatically — new writes get embeddings under the new backend; older rows keep whatever they had (including NULL).
Why this shape, not RAG¶
luplo’s domain is engineering decisions: rationale, alternatives, policy constraints. Semantic proximity is useful, but traceable retrieval is the job. When a future maintainer asks “why did we decide X?” the answer must come with a receipt:
Exact terms that matched
Glossary groups that fired
Item ids and supersedes chain
A pure vector search fabricates relevance from embedding distance and cannot produce that receipt. luplo keeps vectors in a ranking role so retrieval always has a defensible reason to show you what it showed.
How to tune it¶
Missing matches? Add aliases via
lp glossary pending→ approve.False positives in the glossary? Reject them — they’ll land in
glossary_rejectionsand never be suggested again.Need closer-to-semantic ranking? Install
vector-local. Existing writes will rerank from the next worker pass onward.Want to restrict to a system?
lp items search "jwt" --system auth(CLI) orsystem_ids=['<uuid>'](MCP).
Next¶
Connecting an MCP client — hooking MCP clients to these tools.
Semantic impact categories — how item edits are categorised (related but distinct from search).