Data Profiling

Data profiling is the foundation of the Data Governance solution. It analyzes connected databases to produce statistical profiles of tables and columns, identifying data quality issues, completeness gaps, and structural anomalies.

What Data Profiling Captures

Column-Level Metrics
Table-Level Metrics

For each column in a profiled table, the system computes:

Metric	Description
Completeness	Percentage of non-null values
Uniqueness	Ratio of distinct values to total rows
Data type distribution	Actual types vs declared column type
Min / Max / Mean	Statistical bounds for numeric columns
Standard deviation	Spread of numeric values
Pattern analysis	Common formats (email, phone, date patterns)
Top N values	Most frequent values and their counts
Outlier detection	Values beyond 2 standard deviations

For each table:

Metric	Description
Row count	Total number of records
Column count	Number of columns
Completeness score	Average column completeness
Freshness	Last update timestamp
Primary key analysis	Key uniqueness and completeness
Foreign key validation	Referential integrity checks
Duplicate detection	Potential duplicate rows

Profiling Workflow

Data profiling is orchestrated as a Prefect workflow:

Select data assets

An administrator or data steward selects the tables and schemas to profile via the Data Governance API.

Profiling workflow starts

The Prefect flow launches with ConcurrentTaskRunner for parallel table processing. Each table is profiled as an independent task.

Column analysis

For each table, the workflow queries the database to compute column-level statistics. Queries are optimized using sampling for large tables.

Results stored

Profiling results are stored in the Data Governance database (datareadiness) with timestamps for historical tracking.

Scorecard generated

A data quality scorecard summarizes the profiling results across all tables, highlighting issues that need attention.

Concurrency Control

Profiling workflows use hybrid concurrency to efficiently process large databases without overwhelming the source:

@flow(task_runner=ConcurrentTaskRunner(max_workers=10))
async def profile_database(tables: list[str]):
    # Process up to 10 tables concurrently
    tasks = [profile_table(table) for table in tables]
    await asyncio.gather(*tasks)

@task
async def profile_table(table: str):
    semaphore = asyncio.Semaphore(20)
    # Within each table, run up to 20 column analyses concurrently
    async with semaphore:
        await analyze_column(column)

Level	Mechanism	Configuration
Flow-level	`ConcurrentTaskRunner(max_workers=N)`	`config.yaml`
Task-level	`asyncio.Semaphore(N)`	`config.yaml`

Adjust the max_workers and semaphore limits based on your source database capacity. Start with conservative values and increase based on observed database load.

Integration with Data Connections

Profiling uses data connections registered in the platform’s Assets service:

The data steward selects a registered data connection
The profiling workflow retrieves connection credentials from the Secrets API at runtime
A read-only database session is established for profiling queries
All profiling queries use the read-only user configured during data source setup

Profiling queries can be resource-intensive on large tables. Schedule profiling runs during off-peak hours or configure sampling thresholds for tables with millions of rows.

Profiling Results

Profiling results are accessible via the Data Governance API:

# Get profiling results for a table
curl -X GET \
  "https://<host>:8001/data-readiness/data-assets/<asset-id>/profile" \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-Project-ID: <project-id>"

Results include:

{
  "table": "customers",
  "profiled_at": "2026-04-03T10:00:00Z",
  "row_count": 150000,
  "columns": [
    {
      "name": "email",
      "completeness": 0.98,
      "uniqueness": 0.97,
      "pattern": "email",
      "top_values": ["gmail.com", "outlook.com", "yahoo.com"]
    },
    {
      "name": "phone",
      "completeness": 0.72,
      "uniqueness": 0.95,
      "pattern": "phone_us",
      "null_count": 42000
    }
  ],
  "quality_score": 0.85
}

Scheduling Profiles

Profiling can be scheduled for recurring execution via the platform’s scheduling system:

Daily profiles: Track data quality trends over time
Weekly profiles: Standard cadence for most datasets
On-demand: Triggered manually for new data sources or after schema changes

Next Steps

Data Enrichment

Enrich metadata after profiling with LLM-powered descriptions.

Workflows

Learn about the Prefect workflow orchestration system.

Data Source Setup

Connect a database to start profiling.

Data Classification

Understand how profiling data is classified and protected.

Data Insights

Data Governance

Semiconductor

Customer Admin Guide

Data Profiling

Data Profiling

What Data Profiling Captures

Profiling Workflow

Concurrency Control

Integration with Data Connections

Profiling Results

Scheduling Profiles

Next Steps

Data Enrichment

Workflows

Data Source Setup

Data Classification

Data Insights

Data Governance

Semiconductor

Customer Admin Guide

Documentation Index

​Data Profiling

​What Data Profiling Captures

​Profiling Workflow

​Concurrency Control

​Integration with Data Connections

​Profiling Results

​Scheduling Profiles

​Next Steps

Data Enrichment

Workflows

Data Source Setup

Data Classification

Data Profiling

What Data Profiling Captures

Profiling Workflow

Concurrency Control

Integration with Data Connections

Profiling Results

Scheduling Profiles

Next Steps