Skip to main content

Overview

The Data Pipeline is the backbone of Trendteller’s analytics platform, transforming raw e-commerce data into analytics-ready tables using Dataform and Google BigQuery.

Technology Stack

Dataform

Data transformation and orchestration platform

BigQuery

Google’s serverless data warehouse

SQLX

SQL with templating and JavaScript

Git-based

Version-controlled transformations

Medallion Architecture

The pipeline implements a three-layer medallion architecture:
1

Bronze Layer

Raw data ingestion from Airbyte
  • Tables prefixed with bronze_
  • Maintains source format and structure
  • Includes metadata (ingestion timestamp, source brand)
  • Full historical data retained
bronze_bling_orders
bronze_vnda_products
bronze_tiny_customers
2

Silver Layer

Standardized and cleansed data
  • Tables prefixed with silver_
  • Common schema across all brands
  • Data quality rules applied
  • Deduplication and validation
silver_orders      -- All brands unified
silver_products    -- Standardized catalog
silver_customers   -- Consolidated customers
3

Gold Layer

Analytics-ready aggregations
  • Tables prefixed with gold_
  • Pre-aggregated metrics
  • Optimized for query performance
  • Business logic applied
gold_daily_sales
gold_product_performance
gold_customer_lifetime_value

Data Models

Core Tables

  • Transactional
  • Master Data
  • Analytics
Orders - All order transactions across brands
  • order_id, brand_id, customer_id
  • total_amount, status, created_at
  • items[], payments[], shipping
Order Items - Individual line items
  • item_id, order_id, product_id
  • quantity, unit_price, discount
Payments - Payment transactions
  • payment_id, order_id, method
  • amount, status, processed_at

Brand Integrations

Trendteller supports 11 brands across 9 e-commerce platforms:
Platform: Bling ERP Brands: 3 brands Data: Orders, products, customers, inventory, invoices Sync: Incremental (every 6 hours)
Platform: VNDA Fashion Platform Brands: 2 brands Data: Orders, products, categories, variants Sync: Real-time webhooks + daily batch
  • Shoppub: Marketplace integration (2 brands)
  • Tiny: ERP integration (1 brand)
  • Microvix: Retail management (1 brand)
  • Braavo: E-commerce (1 brand)
  • JetERP: Enterprise system (1 brand)

Transformation Logic

SQLX Templates

Dataform uses SQLX for SQL transformations with JavaScript templating:
config {
  type: "table",
  schema: "silver",
  tags: ["orders", "daily"]
}

SELECT
  o.order_id,
  o.brand_id,
  b.brand_name,
  o.customer_id,
  o.total_amount,
  o.order_date,
  ${helpers.calculateDiscount('o.total_amount', 'o.discount_percent')} as discount_amount,
  CURRENT_TIMESTAMP() as processed_at
FROM ${ref("bronze_orders")} o
LEFT JOIN ${ref("dim_brands")} b
  ON o.brand_id = b.brand_id
WHERE o.order_date >= CURRENT_DATE() - 90
The ${ref()} function ensures table dependencies are tracked and executed in the correct order.

Data Quality Checks

Built-in assertions ensure data quality:
config {
  type: "assertion",
  tags: ["quality_check"]
}

-- Ensure no negative order amounts
SELECT *
FROM ${ref("silver_orders")}
WHERE total_amount < 0

Pipeline Orchestration

Execution Schedule

1

Incremental Updates

Every 6 hours, incremental transformations process new data:
  • Bronze → Silver (standardization)
  • Silver → Gold (aggregation)
2

Daily Full Refresh

Once daily at 2 AM (BRT), full refresh for:
  • Historical aggregations
  • Cross-brand metrics
  • Customer lifetime value
3

Manual Triggers

On-demand execution via:
  • Dataform CLI
  • GitHub Actions
  • Dataform Web UI

Dependency Management

Dataform automatically resolves table dependencies:
bronze_orders → silver_orders → gold_daily_sales
                              ↘ gold_customer_segments

Data Warehouse Configuration

BigQuery Setup

Project: togo-425319 Location: southamerica-east1 (São Paulo) Environment: Production
Datasets:
  • bronze - Raw data from sources
  • silver - Standardized tables
  • gold - Analytics-ready views
  • metadata - Pipeline metadata

Performance Optimization

Partitioning

Tables partitioned by date for faster queries

Clustering

Key columns clustered for query optimization

Materialization

Incremental models reduce compute costs

Caching

Query results cached for repeated access

Monitoring & Observability

Pipeline Metrics

  • Execution duration and success rate
  • Data volume processed (rows/GB)
  • Data freshness by table
  • Error rates and failure reasons

Alerting

Automated alerts for:
  • Pipeline execution failures
  • Data quality assertion failures
  • Stale data (exceeds freshness threshold)
  • Resource quota warnings

Development Workflow

Local Development

cd dataform
npm install

# Compile SQLX to SQL
node compile.js

# Validate transformations
dataform compile

# Dry run (no data modification)
dataform run --dry-run

# Execute locally (requires credentials)
dataform run --tags daily

CI/CD Pipeline

# On pull request
- Compile and validate SQL
- Run data quality tests
- Generate documentation

# On merge to main
- Deploy to production
- Execute transformations
- Update documentation

Next Steps