Search pipeline¶

Full-text first. Vectors only rerank. If it’s not there, we say so.

luplo’s search is deliberately unromantic. It is a four-stage pipeline where Postgres tsquery does the retrieving, the glossary does the rewriting, and vectors — when present — only re-order the candidates that tsquery already found. If retrieval finds nothing, the answer is nothing — not a synthesized guess.

Query dialect¶

The search surface accepts a tiny web-search-style dialect. The grammar is intentionally small; nothing else parses.

Syntax	Meaning	Example
`word`	required term (AND-joined with siblings)	`auth budget`
`"exact phrase"`	adjacent-word match	`"JWT rotation"`
`word OR word`	disjunction (the literal word `OR`, uppercase)	`auth OR password`
`-word`	negated term	`auth -session`
`-"exact phrase"`	negated phrase	`auth -"session cookie"`

Clauses are AND-joined by default. An OR run folds its neighbours into a disjunction group. Glossary expansion is applied only to required and OR-group terms — phrases and negated tokens pass through literally (expanding a negation would silently re-include the concept the user excluded).

What is not supported¶

Parentheses for grouping. Nested tsquery-native syntax like (!A & B) & !C does not parse. Use De Morgan’s laws to rewrite: (!A & B) & !C becomes B -A -C; (A | B) & !(C | D) becomes A OR B -C -D.
Operators as literals. &, |, !, (, ) are passed through as part of the word and almost always return zero hits under the simple dictionary. Avoid typing them unless you actually want the character in the lexeme.
Regex / fuzzy matching. Out of scope — see the philosophy doc’s honesty-over-coverage commitment.

Worked example¶

The filter “women who are not men and not in their 50s”:

# Literal tsquery you might type: (!남자 & 여자) & !50대
# luplo dialect equivalent:
여자 -남자 -50대

Both describe the same set. The dialect version parses; the literal tsquery version does not. If the negated-OR case (!(A & B), which cannot be rewritten with plain negation) starts coming up, open an issue — parens support is a declared v0.7 candidate, not a forever “no”.

The four stages¶

   user query
        │
        ▼
┌───────────────────┐
│ 1. Context router │   project_id auto-filled from .luplo,
│                   │   optional system filter
└────────┬──────────┘
         ▼
┌───────────────────┐
│ 2. Glossary       │   normalise → strict alias expansion:
│    expansion      │   "auth" → (auth | authentication | sign-in)
└────────┬──────────┘
         ▼
┌───────────────────┐
│ 3. tsquery        │   Postgres full-text search over
│    retrieval      │   items.ts  (GIN index scan)
└────────┬──────────┘
         ▼
┌───────────────────┐
│ 4. Vector rerank  │   OPTIONAL. pgvector cosine similarity
│    (optional)     │   reorders the tsquery candidate set.
└────────┬──────────┘
         ▼
      results
(with explicit match reasons: which aliases matched which fields)

1. Context routing¶

Every search carries a project_id. In the CLI it comes from .luplo; in MCP it comes from the tool argument. The router can also apply a system filter so that “auth” queries do not drag in items from the “rendering” system. Nothing magical — just scoped WHERE clauses.

2. Glossary expansion¶

The glossary is the mechanism that lets "auth" find items indexed under "authentication" or "sign-in" without requiring the caller to know every alias. The pipeline is strict-first, with three layers:

Deterministic normalization. Lowercase, whitespace collapse, Korean morpheme splitting. Zero false positives — this step is purely mechanical.
Strict LLM matching. Only translation-grade synonyms (sign-in ↔ login) make it into a glossary group. The prompt is tuned so that NONE is a better answer than a wrong grouping.
Human curation queue. Candidates the LLM is unsure about go into glossary_terms with status='pending', visible via lp glossary pending and luplo_page_sync.

The query is rewritten into a tsquery expression that ORs the aliases within each matched group:

"auth rate limit"
  → (auth | authentication | sign-in) & ("rate limit" | throttle | quota)

Every alias that fires is recorded with the result, so downstream tooling can show why a match was returned.

`is_protected` — when clustering should stay away¶

The LLM flags terms as is_protected=true whenever they look like identifiers that must never be pulled into a synonym cluster:

Upper-case acronyms (JWT, API, OTP, TLS, RPC)
Programming identifiers (snake_case, camelCase)
Proper nouns not in the general dictionary — project codenames like Prometheus, Sentinel, Gatekeeper

Protected terms participate in exact match and deterministic normalization, but the strict LLM step refuses to cluster them with anything else. This is the guardrail that keeps OTP from becoming an alias of opt or Sentinel from being merged with guard.

3. tsquery retrieval¶

Retrieval runs against the generated items.ts column — a concatenation of title, body, rationale, alternatives, tags, and (when present) a few context fields — indexed by GIN.

The result limit at this stage is intentionally generous (default: fetch limit × 4 candidates) so the vector reranker, when enabled, has room to re-order.

4. Vector reranking (optional)¶

When the vector-local extra is installed and pgvector is available, each candidate’s embedding is compared to the query embedding via cosine similarity. The top limit results after reranking are returned.

Vector search never originates candidates. If tsquery returns zero rows, vector rerank has nothing to do, and the search returns empty. This is the honesty rule — see Philosophy — encoded in code.

Embedding backends¶

Three drop-in backends exist:

Backend	Dimensions	When to use
`null` (default)	—	Don’t want the ML dependency. Search is still glossary + tsquery, just no rerank.
`local`	1024	`uv sync --extra vector-local` — runs sentence-transformers locally (~500MB).
`remote`	1024	Call an external embedding service. For deployments where the worker can afford it but clients can’t.

Switching backends does not re-embed history automatically — new writes get embeddings under the new backend; older rows keep whatever they had (including NULL).

Why this shape, not RAG¶

luplo’s domain is engineering decisions: rationale, alternatives, policy constraints. Semantic proximity is useful, but traceable retrieval is the job. When a future maintainer asks “why did we decide X?” the answer must come with a receipt:

Exact terms that matched
Glossary groups that fired
Item ids and supersedes chain

A pure vector search fabricates relevance from embedding distance and cannot produce that receipt. luplo keeps vectors in a ranking role so retrieval always has a defensible reason to show you what it showed.

How to tune it¶

Missing matches? Add aliases via lp glossary pending → approve.
False positives in the glossary? Reject them — they’ll land in glossary_rejections and never be suggested again.
Need closer-to-semantic ranking? Install vector-local. Existing writes will rerank from the next worker pass onward.
Want to restrict to a system? lp items search "jwt" --system auth (CLI) or system_ids=['<uuid>'] (MCP).

Next¶

Connecting an MCP client — hooking MCP clients to these tools.
Semantic impact categories — how item edits are categorised (related but distinct from search).