The manual gap between Airbyte/Fivetran and dbt that every data engineer fills by hand, every single time
VP of Sales says "I need HubSpot data in Looker by Friday." The junior DE enables Airbyte's HubSpot connector. It syncs 47 streams into BigQuery, creating tables like:
raw_hubspot.deals raw_hubspot.deals_properties_hs_deal_amount raw_hubspot.deals_associations_company_ids raw_hubspot.contacts raw_hubspot.contacts_properties_hs_email ... (42 more tables)
Now the DE needs to turn these into clean dbt staging models. This means:
deals_properties_hs_deal_amount to deal_amountSUM(amount) fails on strings)This takes 4-8 hours of manual work. Every time. For every new source.
This was the only scenario where severity dropped from 9 to 7 with tooling. In all other scenarios (schema drift, performance, migration), severity stayed at 8-9 regardless of tools. This means the problem is genuinely solvable with automation.
-- Manual: 4-8 hours per source
-- Step 1: Figure out which tables exist
SELECT table_name FROM
information_schema.tables
WHERE dataset_id = 'raw_hubspot';
-- (returns 47 tables, most unknown)
-- Step 2: Inspect each table schema
-- Step 3: Guess which columns matter
-- Step 4: Write staging model by hand
SELECT
id AS deal_id,
-- Wait, is it 'amount' or
-- 'properties_hs_deal_amount'?
-- And it's a STRING??
CAST(properties_hs_deal_amount
AS NUMERIC) AS deal_amount,
-- How do I get owner name...?
properties_hubspot_owner_id
AS owner_id -- shows "12345"
FROM raw_hubspot.deals
-- Auto-generated in 30 seconds
-- Source: raw_hubspot.deals
-- Relevant streams detected: 5 of 47
SELECT
id AS deal_id,
CAST(properties_hs_deal_amount
AS NUMERIC) AS deal_amount,
properties_dealstage AS deal_stage,
properties_closedate AS closed_at,
o.owner_name,
a.company_id
FROM raw_hubspot.deals d
LEFT JOIN raw_hubspot.owners o
ON d.properties_hubspot_owner_id
= o.owner_id
LEFT JOIN raw_hubspot
.deals_associations_company_ids a
ON d.id = a.deal_id
-- schema.yml also generated with:
-- column descriptions
-- not_null tests on deal_id
-- accepted_values on deal_stage
-- positive value test on deal_amount
Airbyte's job ends at "data landed in warehouse." It doesn't know what dbt needs.
dbt's job starts at "here's my source table." It doesn't know what Airbyte created.
The gap in between — understanding the raw schema, cleaning names, fixing types, adding joins, generating tests — is 100% manual. This is true for Fivetran too.
| Existing Tool | What It Does | What It Doesn't Do |
|---|---|---|
| Airbyte | Syncs data to warehouse | Doesn't generate dbt models |
| Fivetran dbt packages | Pre-built models for some sources | Only covers ~30 sources; doesn't handle custom properties |
| dbt codegen | Generates base model from source | No smart renaming, no type detection, no join inference |
| Source-to-Staging Bridge | Reads raw schema + sample data → generates complete staging model with smart naming, type casting, join inference, and tests |
New raw table detected in BigQuery/Snowflake landing schema (via information_schema polling or event trigger).
deals_properties_hs_deal_amount → deal_amount| Tier | Price | Features |
|---|---|---|
| Free (Agent Skill) | $0 | Generate staging models on-demand via CLI |
| Pro | $49/mo | Continuous monitoring for new tables + auto-PR |
| Team | $199/mo | Multi-source, naming convention enforcement, cost estimation |
| Enterprise | $499/mo | Custom connectors, PII detection, compliance templates |