Filter by meaning, not keywords.
MEANS, CLASSIFY, and SUMMARIZE work like native SQL. Connect any database and query from DBeaver, DataGrip, Tableau, or psql. No Python required. No notebooks. No new tools.
SELECT company, region,
SENTIMENT(last_ticket) AS mood
FROM support_tickets
WHERE last_ticket MEANS
'thinking about switching vendors'
ORDER BY mood ASC;| company | region | mood |
|---|---|---|
| Northwind Health | East | -0.86 |
| Aster Labs | West | -0.73 |
| Harbor Retail | Central | -0.61 |
Start with a single semantic filter. By end of week, you're running verified AI aggregations across your entire warehouse.
-- Point at your database
CREATE CONNECTION prod
TYPE postgres HOST 'db.company.com'
DATABASE 'analytics';
-- Query it with meaning
SELECT *
FROM support_tickets
WHERE body MEANS
'billing dispute';SELECT
TOPICS(tweet, 4) AS topic,
CLASSIFY(tweet,
'political, not-political')
AS political,
COUNT(*) AS tweets,
AVG(likes) AS avg_likes
FROM twitter_archive
GROUP BY topic, political;CREATE MODEL churn_risk
FROM (SELECT tenure, monthly_charges,
contract, has_dependents,
is_churned
FROM customers
WHERE created_at < '2026-01-01')
TARGET is_churned
TYPE classifier;
-- predict_churn_risk is now a scalar SQL function
SELECT account_name,
predict_churn_risk(tenure, monthly_charges,
contract, has_dependents)
AS risk
FROM customer_accounts;THEN chains stages of the whole result table. Profile, pivot, chart — no DataFrames, no notebooks, no imports.
SELECT region, product, SUM(revenue) AS revenue
FROM sales
WHERE created_at >= '2026-01-01'
GROUP BY region, product
THEN STATS -- profile every column
THEN PIVOT 'revenue by region' -- smart cross-tab
THEN TO_VEGALITE 'bar chart by product';Every stage chains
Schema-fingerprinted. Same table shape + same prompt → instant cache hit. The LLM only runs the first time any shape sees the query.
Filtering, classification, forecasting, training, vision, speech, knowledge graphs — all callable from SQL.
Entity resolution. Dedupe across sources. Fuzzy customer reconciliation. The last leg of the semantic trilogy is a SQL join condition that understands what your columns actually mean.
Backed by the same cross-encoder reranker that powers OUTLIERS. Every join pair gets a calibrated relevance score; threshold is tunable per query. No nested-loop embedding hacks.
-- Match support tickets to warehouse customers
-- when there's no account ID to join on
SELECT c.company, t.subject,
SENTIMENT(t.body) AS mood
FROM customers c
SEMANTIC JOIN support_tickets t
ON c.description ~ t.reported_by_blurb
WHERE t.body MEANS 'churn risk'
ORDER BY mood ASC;Define custom AI functions and reusable pipeline stages in SQL DDL — no Python, no deploy pipeline. CREATE it, then SELECT with it or reuse it in THEN chains.
Not a fixed catalog
The same system defines row operators, THEN pipeline stages, and action workflows. DataRabbit is a mutable operator surface, not a hard-coded list of AI functions.
CREATE SEMANTIC OPERATOR
compliance_check(text VARCHAR)
RETURNS VARCHAR
PROMPT 'Check if this text violates
our compliance policy.
Return: pass, warn, or fail.
Text: {{ input.text }}';Then use it immediately:
SELECT clause, compliance_check(clause)
FROM contracts
WHERE compliance_check(clause)
MEANS 'fail';Audio columns. Image columns. PDF columns. Unstructured text. All first-class citizens of the query plan.
SELECT call_id, agent,
TRANSCRIBE(recording) AS text,
SENTIMENT(
TRANSCRIBE(recording)
) AS mood,
CLASSIFY(TRANSCRIBE(recording),
'refund, renewal, tech')
AS topic
FROM support_calls;Whisper large-v3-turbo. Chunked inference. Hour-long recordings compose with every text operator.
SELECT product_id,
photo IMAGE_MATCHES
'contains a person'
AS has_model,
photo IMAGE_MATCHES
'damaged or returned'
AS is_damaged
FROM catalog
WHERE photo IMAGE_MATCHES
'red sports car';SigLIP 2 shared image/text space. Threshold-tunable, inline WHERE/SELECT predicates, no sidecar service.
-- Invoices → columns, no keying
SELECT d.vendor, d.total,
d.issue_date, d.line_items
FROM s3('s3://invoices/') files
CROSS JOIN LATERAL
parse_document(files.contents) d
WHERE d.total > 1000
ORDER BY d.issue_date DESC;Donut, LayoutLMv3, Nougat, GOT-OCR2 — document foundation models. PDFs are columns. Invoices, receipts, contracts, forms.
-- extract + materialize a graph
SELECT * FROM emails,
LATERAL rich_triples_rows(body) t
THEN TO_PROPERTY_GRAPH('org');
-- then traverse it
WITH RECURSIVE walk AS (
SELECT source_id, target_id,
ARRAY[predicate] AS path
FROM org_edges
WHERE source_id = 'Jane'
UNION ALL …
) SELECT * FROM walk;RICH_TRIPLES extracts (subject, predicate, object, evidence). TO_PROPERTY_GRAPH materializes relational node/edge tables. No graph DB.
Bayesian experimentation. Causal inference from observational data. Survival analysis. All as SQL operators — not as a Python notebook your PM can't read.
-- Weekly: experiments with >95% P(lift)
SELECT experiment_id, variant,
bayesian_lift(control, treatment)
AS p_lift,
bayesian_effect(control, treatment)
AS expected_effect,
credible_interval(control,
treatment, 0.95) AS ci_95
FROM experiment_stats
WHERE bayesian_lift(control, treatment)
> 0.95;PyMC + arviz. Posterior distributions, credible intervals. Your weekly experimentation report is one query.
-- Did the discount CAUSE retention,
-- or were we picking loyal customers?
SELECT treatment_effect(
outcome => retained_90d,
treatment => got_discount,
covariates => [plan, tenure,
arr, industry]
) AS ate
FROM customers
WHERE eligible_for_study;EconML / DoWhy doubly-robust estimation. Propensity scoring + outcome modeling, covariates as an array.
-- Remaining subscription life, by plan
SELECT plan,
AVG(expected_lifetime(
tenure_days, churned))
AS avg_remaining_days,
kaplan_meier(tenure_days,
churned, 0.5)
AS median_lifetime
FROM customers
GROUP BY plan
ORDER BY avg_remaining_days DESC;lifelines. Kaplan-Meier curves, Cox proportional hazards, Weibull AFT. "When will this churn?" as a GROUP BY.
Inline Python expressions, pandas pipelines, and NumPy — right inside your SQL. Chain with AI operators for transforms no single language can do alone.
SQL alone
SELECT state, count(*) FROM sightings GROUP BY state+ Python
... THEN PY 'df.nlargest(5, "count")'+ AI
... THEN ANALYZE('why these states?')SELECT name, salary,
py('round(value * 0.3, 2)', salary) AS tax,
py('len(text.split())', bio) AS bio_words
FROM employees;SELECT * FROM sales
THEN PY '
df["margin"] = df["revenue"] - df["cost"]
df[df["margin"] > 0].nlargest(10, "margin")
'
THEN SUMMARIZE(description);Create reusable Python functions:
CREATE PYTHON FUNCTION median(vals DOUBLE)
RETURNS DOUBLE SHAPE AGGREGATE
AS $$ sorted(vals)[len(vals)//2] $$;
SELECT dept, median(salary) FROM emp GROUP BY dept;Scrape web pages, search the internet, parse RSS feeds, or connect any API — all from SQL. Results compose with semantic operators and JOIN with your own data. No ETL. No scripts. No new tools.
Connect any MCP server. Its tools become SQL functions.
Describe what you need. The agent navigates and extracts into your schema.
SELECT * FROM WEB_AGENT(
'YC W24 AI companies',
'{company, website, stars}'
);-- Scrape competitor pricing, analyze semantically
WITH competitor AS (
SELECT * FROM WEB(
'https://competitor.com/pricing',
tier VARCHAR, price DECIMAL,
features TEXT
)
)
SELECT
tier, price,
features MEANS 'AI-powered' AS has_ai
FROM competitor
WHERE price < 100;-- Connect a service
CREATE MCP SERVER github
COMMAND 'npx'
ARGS '-y @anthropic/github-mcp';
-- Discover its tools with semantic search
SELECT tool_name, description
FROM mcp_server_tools('github')
WHERE description MEANS
'find bugs';
-- Query it — with semantic SQL
SELECT title, assignee,
CLASSIFY(body,
'bug, feature, support') AS type,
SENTIMENT(body) AS frustration
FROM github.issues(
'myorg/api', 'open')
WHERE body MEANS
'performance regression'
ORDER BY frustration ASC;DataRabbit speaks the PostgreSQL wire protocol. Connect from psql, DBeaver, DataGrip, Tableau, or anything that talks to Postgres. Queries run where your data lives.
PostgreSQL, MySQL, Snowflake, BigQuery, ClickHouse, S3, Parquet. Data never leaves your infrastructure.
Use your existing SQL client. Add operators like MEANS, CLASSIFY, or SUMMARIZE alongside standard SQL.
The system fingerprints data shapes — not individual values. A million phone numbers might have 10 formats. 10 LLM calls generate SQL expressions. The expressions run on every future row. No LLM needed.
Most operators aren't LLM calls. DataRabbit picks the fastest, smallest, deterministic-when-possible model for every task — and reaches for an LLM only when reasoning is actually the point.
One forward pass, no fit loop, no hyperparameter search. Training-data hash tracks drift.
Amazon zero-shot forecasting. Quantile bands, composable with GROUP BY.
Shared image/text space, L2-normalized cosine.
Chunked inference. Hour-long audio out of the box.
Donut, LayoutLMv3, Nougat, GOT-OCR2. Zero-shot invoice, receipt, and contract parsing.
~10ms/row, zero hallucination, pick-from-set by design.
SOTA dense retrieval plus cross-encoder re-ranking.
(head, type, tail) triples without LLM hallucination.
jellyfish, phonenumbers, usaddress. Microseconds, zero cost.
Posterior distributions, credible intervals. Experiment reports in one SELECT.
Doubly-robust causal estimation, Kaplan-Meier, Cox proportional hazards. Serious stats, SQL-native.
Show up only when the task genuinely needs judgment.
Most operators skip the LLM entirely — so your inference lanes stay fast for the ones that actually need reasoning.
LLM outputs are non-deterministic. DataRabbit has first-class primitives to make them reliable, auditable, and cache-friendly so a million rows reach for the LLM ten times, not a million.
Run N model variations in parallel. An evaluator picks the best. No serial retry loops.
Fingerprints data structures, generates SQL expressions, caches the code not the values. A million rows, ~10 LLM calls. Re-runs hit the cache — no new inference, no new latency.
Every query is embedded and summarized. Search past analyses by meaning. 'Did anyone look at churn by region?' surfaces the answer — and the SQL.
Flag any operator result as 'correct.' Future calls automatically include your validated examples. No ML pipeline. No fine-tuning. Just click 'this was right.'
Every tier is unlimited. Paid plans buy more parallel inference lanes and higher priority — never query counts. No credit card. If you already have a SQL client open, you're ready.
No credit counter. No token meter. Fire off queries freely — the shape compiler caches repeat work, so curiosity doesn't cost extra.
Works with PostgreSQL, MySQL, Snowflake, BigQuery, ClickHouse, S3, Parquet. Connects via pgwire — use psql, DBeaver, DataGrip, Tableau, or any Postgres-compatible client.