Semantic SQL · Start simple

Start with semantic SQL.
Keep going without leaving SQL.

Filter by meaning, not keywords.

MEANS, CLASSIFY, and SUMMARIZE work like native SQL. Connect any database and query from DBeaver, DataGrip, Tableau, or psql. No Python required. No notebooks. No new tools.

MEANSCLASSIFYSUMMARIZE
PostgreSQLMySQLSnowflakeBigQueryClickHouseS3MCPthe web
SELECT company, region,
       SENTIMENT(last_ticket) AS mood
FROM   support_tickets
WHERE  last_ticket MEANS
         'thinking about switching vendors'
ORDER  BY mood ASC;
companyregionmood
Northwind HealthEast-0.86
Aster LabsWest-0.73
Harbor RetailCentral-0.61
3 rows24mscached

Down the rabbit hole

Start with a single semantic filter. By end of week, you're running verified AI aggregations across your entire warehouse.

Day 1Connect & query
start.sql
-- Point at your database
CREATE CONNECTION prod
  TYPE postgres HOST 'db.company.com'
  DATABASE 'analytics';

-- Query it with meaning
SELECT *
FROM   support_tickets
WHERE  body MEANS
  'billing dispute';
→ 23 rows matching semantic intent
Day 3AI creates the dimensions
topics.sql
SELECT
  TOPICS(tweet, 4) AS topic,
  CLASSIFY(tweet,
    'political, not-political')
    AS political,
  COUNT(*) AS tweets,
  AVG(likes) AS avg_likes
FROM   twitter_archive
GROUP BY topic, political;
→ You didn't define the topics. The data organized itself.
Week 1Your warehouse becomes the model
model.sql
CREATE MODEL churn_risk
  FROM (SELECT tenure, monthly_charges,
              contract, has_dependents,
              is_churned
        FROM   customers
        WHERE  created_at < '2026-01-01')
  TARGET is_churned
  TYPE classifier;

-- predict_churn_risk is now a scalar SQL function
SELECT account_name,
  predict_churn_risk(tenure, monthly_charges,
                     contract, has_dependents)
    AS risk
FROM   customer_accounts;
→ TabPFN v2 foundation model. One forward pass, no fit loop. Training-data hash tracks drift.
wonderland
§ Pipelines

Pipelines in one query.

THEN chains stages of the whole result table. Profile, pivot, chart — no DataFrames, no notebooks, no imports.

pipeline.sql
SELECT region, product, SUM(revenue) AS revenue
FROM   sales
WHERE  created_at >= '2026-01-01'
GROUP BY region, product

THEN STATS                                 -- profile every column
THEN PIVOT 'revenue by region'           -- smart cross-tab
THEN TO_VEGALITE 'bar chart by product';
→ One query. Four stages. You didn't open a notebook.

Every stage chains

THEN ANALYZETHEN STATSTHEN PIVOTTHEN TO_VEGALITETHEN TO_PLOTLYTHEN TO_PROPERTY_GRAPHTHEN CONDENSETHEN PYTHEN SPEAKTHEN ENRICHTHEN FILTERTHEN GROUPTHEN TOPTHEN SAMPLETHEN DEDUPETHEN STYLIZE

Schema-fingerprinted. Same table shape + same prompt → instant cache hit. The LLM only runs the first time any shape sees the query.

§ Catalog

150+ built-in operators.

Filtering, classification, forecasting, training, vision, speech, knowledge graphs — all callable from SQL.

ML & Forecast

CREATE MODELFORECASTpredict_*score_*

Vision, Speech & Docs

IMAGE_MATCHESIMAGE_EMBEDTRANSCRIBEparse_document

Semantic filters & joins

MEANSCLASSIFYSENTIMENTSEMANTIC JOINCONTRADICTSSIMILAR_TO

Dimensional

TOPICSTHEMESCLUSTERCONSENSUSVIBES

Semantic aggregates

BESTGOLDEN_RECORDRANK_AGGOUTLIERS

Semantic control flow

SEMANTIC_CASESEMANTIC_SWITCH

Causal & statistical

BAYESIAN_LIFTTREATMENT_EFFECTEXPECTED_LIFETIMEKAPLAN_MEIER

Argumentation

STEELMANCOUNTERARGUMENTWEAKNESSESFALLACY

Data quality

VALIDATEDEDUPECORRECTFILLLOOKS_LIKE

Parsing

PARSESMART_JSONPARSE_ADDRESSPARSE_NAMESOUNDS_LIKE

Knowledge graph

TRIPLESRICH_TRIPLESRELATIONSTO_PROPERTY_GRAPH

Retrieval & vectors

EMBEDVECTOR_SEARCHEMBED_COLUMN
§ Join by meaning

Filter, group, and now JOIN all by meaning.

Entity resolution. Dedupe across sources. Fuzzy customer reconciliation. The last leg of the semantic trilogy is a SQL join condition that understands what your columns actually mean.

Backed by the same cross-encoder reranker that powers OUTLIERS. Every join pair gets a calibrated relevance score; threshold is tunable per query. No nested-loop embedding hacks.

semantic-join.sql
-- Match support tickets to warehouse customers
-- when there's no account ID to join on
SELECT c.company, t.subject,
       SENTIMENT(t.body) AS mood
FROM   customers c
SEMANTIC JOIN support_tickets t
  ON c.description ~ t.reported_by_blurb
WHERE  t.body MEANS 'churn risk'
ORDER  BY mood ASC;
→ No exact-match keys. No string similarity heuristics. Rows matched by what they mean.
§ Extend

Build your own operators.

Define custom AI functions and reusable pipeline stages in SQL DDL — no Python, no deploy pipeline. CREATE it, then SELECT with it or reuse it in THEN chains.

Not a fixed catalog

The same system defines row operators, THEN pipeline stages, and action workflows. DataRabbit is a mutable operator surface, not a hard-coded list of AI functions.

row operatorsTHEN stagesaction workflows
operator.sql
CREATE SEMANTIC OPERATOR
  compliance_check(text VARCHAR)
  RETURNS VARCHAR
  PROMPT 'Check if this text violates
    our compliance policy.
    Return: pass, warn, or fail.
    Text: {{ input.text }}';

Then use it immediately:

SELECT clause, compliance_check(clause)
FROM   contracts
WHERE  compliance_check(clause)
       MEANS 'fail';
§ Sees, hears, reads, maps

SQL that sees, hears, reads,
and maps connections.

Audio columns. Image columns. PDF columns. Unstructured text. All first-class citizens of the query plan.

Hear

TRANSCRIBE
calls.sql
SELECT call_id, agent,
  TRANSCRIBE(recording) AS text,
  SENTIMENT(
    TRANSCRIBE(recording)
  ) AS mood,
  CLASSIFY(TRANSCRIBE(recording),
      'refund, renewal, tech')
    AS topic
FROM   support_calls;

Whisper large-v3-turbo. Chunked inference. Hour-long recordings compose with every text operator.

See

IMAGE_MATCHES
catalog.sql
SELECT product_id,
  photo IMAGE_MATCHES
    'contains a person'
    AS has_model,
  photo IMAGE_MATCHES
    'damaged or returned'
    AS is_damaged
FROM   catalog
WHERE  photo IMAGE_MATCHES
         'red sports car';

SigLIP 2 shared image/text space. Threshold-tunable, inline WHERE/SELECT predicates, no sidecar service.

Read

parse_document
invoices.sql
-- Invoices → columns, no keying
SELECT d.vendor, d.total,
       d.issue_date, d.line_items
FROM   s3('s3://invoices/') files
CROSS JOIN LATERAL
  parse_document(files.contents) d
WHERE  d.total > 1000
ORDER  BY d.issue_date DESC;

Donut, LayoutLMv3, Nougat, GOT-OCR2 — document foundation models. PDFs are columns. Invoices, receipts, contracts, forms.

Map

TRIPLES · GRAPH
kg.sql
-- extract + materialize a graph
SELECT * FROM emails,
  LATERAL rich_triples_rows(body) t
THEN TO_PROPERTY_GRAPH('org');

-- then traverse it
WITH RECURSIVE walk AS (
  SELECT source_id, target_id,
    ARRAY[predicate] AS path
  FROM   org_edges
  WHERE  source_id = 'Jane'
  UNION ALL 
) SELECT * FROM walk;

RICH_TRIPLES extracts (subject, predicate, object, evidence). TO_PROPERTY_GRAPH materializes relational node/edge tables. No graph DB.

§ Causal & statistical

The stats you'd hire a data scientist for.

Bayesian experimentation. Causal inference from observational data. Survival analysis. All as SQL operators — not as a Python notebook your PM can't read.

Experiment

BAYESIAN_LIFT
experiments.sql
-- Weekly: experiments with >95% P(lift)
SELECT experiment_id, variant,
  bayesian_lift(control, treatment)
    AS p_lift,
  bayesian_effect(control, treatment)
    AS expected_effect,
  credible_interval(control,
      treatment, 0.95) AS ci_95
FROM   experiment_stats
WHERE  bayesian_lift(control, treatment)
        > 0.95;

PyMC + arviz. Posterior distributions, credible intervals. Your weekly experimentation report is one query.

Cause

TREATMENT_EFFECT
causal.sql
-- Did the discount CAUSE retention,
-- or were we picking loyal customers?
SELECT treatment_effect(
  outcome     => retained_90d,
  treatment   => got_discount,
  covariates  => [plan, tenure,
                  arr, industry]
) AS ate
FROM   customers
WHERE  eligible_for_study;

EconML / DoWhy doubly-robust estimation. Propensity scoring + outcome modeling, covariates as an array.

Survive

EXPECTED_LIFETIME
survival.sql
-- Remaining subscription life, by plan
SELECT plan,
  AVG(expected_lifetime(
        tenure_days, churned))
    AS avg_remaining_days,
  kaplan_meier(tenure_days,
               churned, 0.5)
    AS median_lifetime
FROM   customers
GROUP  BY plan
ORDER  BY avg_remaining_days DESC;

lifelines. Kaplan-Meier curves, Cox proportional hazards, Weibull AFT. "When will this churn?" as a GROUP BY.

SQL × Python × AI

Inline Python expressions, pandas pipelines, and NumPy — right inside your SQL. Chain with AI operators for transforms no single language can do alone.

SQL alone

SELECT state, count(*) FROM sightings GROUP BY state

+ Python

... THEN PY 'df.nlargest(5, "count")'

+ AI

... THEN ANALYZE('why these states?')
py() — inline expressions
SELECT name, salary,
  py('round(value * 0.3, 2)', salary) AS tax,
  py('len(text.split())', bio) AS bio_words
FROM employees;
THEN PY — pandas pipeline
SELECT * FROM sales
THEN PY '
  df["margin"] = df["revenue"] - df["cost"]
  df[df["margin"] > 0].nlargest(10, "margin")
'
THEN SUMMARIZE(description);

Create reusable Python functions:

CREATE PYTHON FUNCTION median(vals DOUBLE)
  RETURNS DOUBLE SHAPE AGGREGATE
  AS $$ sorted(vals)[len(vals)//2] $$;

SELECT dept, median(salary) FROM emp GROUP BY dept;

The web is a database. Query it.

Scrape web pages, search the internet, parse RSS feeds, or connect any API — all from SQL. Results compose with semantic operators and JOIN with your own data. No ETL. No scripts. No new tools.

Built-in web operators

WEBSCRAPERSSWEB_SEARCHWEB_MAPWEB_EXTRACT

API federation via MCP

Connect any MCP server. Its tools become SQL functions.

GitHubSlackLinearNotionGoogle Driveany MCP server

Autonomous web research

Describe what you need. The agent navigates and extracts into your schema.

SELECT * FROM WEB_AGENT(
  'YC W24 AI companies',
  '{company, website, stars}'
);
federated.sql
-- Scrape competitor pricing, analyze semantically
WITH competitor AS (
  SELECT * FROM WEB(
    'https://competitor.com/pricing',
    tier VARCHAR, price DECIMAL,
    features TEXT
  )
)
SELECT
  tier, price,
  features MEANS 'AI-powered' AS has_ai
FROM competitor
WHERE price < 100;
→ Scraped, parsed, and semantically filtered — in one query.
mcp.sql
-- Connect a service
CREATE MCP SERVER github
  COMMAND 'npx'
  ARGS '-y @anthropic/github-mcp';

-- Discover its tools with semantic search
SELECT tool_name, description
FROM   mcp_server_tools('github')
WHERE  description MEANS
         'find bugs';

-- Query it — with semantic SQL
SELECT title, assignee,
  CLASSIFY(body,
    'bug, feature, support') AS type,
  SENTIMENT(body) AS frustration
FROM   github.issues(
         'myorg/api', 'open')
WHERE  body MEANS
         'performance regression'
ORDER BY frustration ASC;
→ Connect → discover → query. Semantic operators compose across any data source.

Your data never moves

DataRabbit speaks the PostgreSQL wire protocol. Connect from psql, DBeaver, DataGrip, Tableau, or anything that talks to Postgres. Queries run where your data lives.

01

Bring your databases

PostgreSQL, MySQL, Snowflake, BigQuery, ClickHouse, S3, Parquet. Data never leaves your infrastructure.

02

Write semantic SQL

Use your existing SQL client. Add operators like MEANS, CLASSIFY, or SUMMARIZE alongside standard SQL.

03

LLM once, SQL forever

The system fingerprints data shapes — not individual values. A million phone numbers might have 10 formats. 10 LLM calls generate SQL expressions. The expressions run on every future row. No LLM needed.

§ The zoo

The right model for each job.

Most operators aren't LLM calls. DataRabbit picks the fastest, smallest, deterministic-when-possible model for every task — and reaches for an LLM only when reasoning is actually the point.

TabPFN v2

tabular
CREATE MODELpredict_*score_*

One forward pass, no fit loop, no hyperparameter search. Training-data hash tracks drift.

Chronos-Bolt

time series
FORECAST

Amazon zero-shot forecasting. Quantile bands, composable with GROUP BY.

SigLIP 2

vision
IMAGE_MATCHESIMAGE_SIMILARITYIMAGE_EMBED

Shared image/text space, L2-normalized cosine.

Whisper v3 Turbo

speech
TRANSCRIBE

Chunked inference. Hour-long audio out of the box.

Document AI stack

PDFs → columns
parse_document

Donut, LayoutLMv3, Nougat, GOT-OCR2. Zero-shot invoice, receipt, and contract parsing.

DeBERTa-v3 NLI

closed-set classify
CLASSIFYSEMANTIC_CASESEMANTIC_SWITCH

~10ms/row, zero hallucination, pick-from-set by design.

bge-m3 + reranker

retrieval
EMBEDVECTOR_SEARCHOUTLIERS

SOTA dense retrieval plus cross-encoder re-ranking.

Rebel-large

open IE
RELATIONS

(head, type, tail) triples without LLM hallucination.

Deterministic libs

parse + phonetics
SOUNDS_LIKEPARSE_EMAILPARSE_PHONEPARSE_ADDRESSLOOKS_LIKE

jellyfish, phonenumbers, usaddress. Microseconds, zero cost.

PyMC + arviz

Bayesian inference
BAYESIAN_LIFTBAYESIAN_EFFECTCREDIBLE_INTERVAL

Posterior distributions, credible intervals. Experiment reports in one SELECT.

EconML + lifelines

causal + survival
TREATMENT_EFFECTEXPECTED_LIFETIMEKAPLAN_MEIERCOX_HAZARD

Doubly-robust causal estimation, Kaplan-Meier, Cox proportional hazards. Serious stats, SQL-native.

LLMs (fast · quality)

when reasoning is the point
MEANSCONDENSEASKGOLDEN_RECORDSTEELMAN

Show up only when the task genuinely needs judgment.

Most operators skip the LLM entirely — so your inference lanes stay fast for the ones that actually need reasoning.

Built for production, not demos

LLM outputs are non-deterministic. DataRabbit has first-class primitives to make them reliable, auditable, and cache-friendly so a million rows reach for the LLM ten times, not a million.

Takes

parallel(A, B, C) → pick(best)

Run N model variations in parallel. An evaluator picks the best. No serial retry loops.

Shape compiler

1M rows → 10 shapes → 10 calls

Fingerprints data structures, generates SQL expressions, caches the code not the values. A million rows, ~10 LLM calls. Re-runs hit the cache — no new inference, no new latency.

Query memory

search("churn by region") → 3 hits

Every query is embedded and summarized. Search past analyses by meaning. 'Did anyone look at churn by region?' surfaces the answer — and the SQL.

Few-shot training

flag(correct) → auto-example

Flag any operator result as 'correct.' Future calls automatically include your validated examples. No ML pipeline. No fine-tuning. Just click 'this was right.'

Your team already knows SQL.
Now it knows AI.

Every tier is unlimited. Paid plans buy more parallel inference lanes and higher priority — never query counts. No credit card. If you already have a SQL client open, you're ready.

No credit counter. No token meter. Fire off queries freely — the shape compiler caches repeat work, so curiosity doesn't cost extra.

Works with PostgreSQL, MySQL, Snowflake, BigQuery, ClickHouse, S3, Parquet. Connects via pgwire — use psql, DBeaver, DataGrip, Tableau, or any Postgres-compatible client.