Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.emergence.ai/llms.txt

Use this file to discover all available pages before exploring further.

Data Profiling

Data profiling is the foundation of the Data Governance solution. It analyzes connected databases to produce statistical profiles of tables and columns, identifying data quality issues, completeness gaps, and structural anomalies.

What Data Profiling Captures

For each column in a profiled table, the system computes:
MetricDescription
CompletenessPercentage of non-null values
UniquenessRatio of distinct values to total rows
Data type distributionActual types vs declared column type
Min / Max / MeanStatistical bounds for numeric columns
Standard deviationSpread of numeric values
Pattern analysisCommon formats (email, phone, date patterns)
Top N valuesMost frequent values and their counts
Outlier detectionValues beyond 2 standard deviations

Profiling Workflow

Data profiling is orchestrated as a Prefect workflow:
1

Select data assets

An administrator or data steward selects the tables and schemas to profile via the Data Governance API.
2

Profiling workflow starts

The Prefect flow launches with ConcurrentTaskRunner for parallel table processing. Each table is profiled as an independent task.
3

Column analysis

For each table, the workflow queries the database to compute column-level statistics. Queries are optimized using sampling for large tables.
4

Results stored

Profiling results are stored in the Data Governance database (datareadiness) with timestamps for historical tracking.
5

Scorecard generated

A data quality scorecard summarizes the profiling results across all tables, highlighting issues that need attention.

Concurrency Control

Profiling workflows use hybrid concurrency to efficiently process large databases without overwhelming the source:
@flow(task_runner=ConcurrentTaskRunner(max_workers=10))
async def profile_database(tables: list[str]):
    # Process up to 10 tables concurrently
    tasks = [profile_table(table) for table in tables]
    await asyncio.gather(*tasks)

@task
async def profile_table(table: str):
    semaphore = asyncio.Semaphore(20)
    # Within each table, run up to 20 column analyses concurrently
    async with semaphore:
        await analyze_column(column)
LevelMechanismConfiguration
Flow-levelConcurrentTaskRunner(max_workers=N)config.yaml
Task-levelasyncio.Semaphore(N)config.yaml
Adjust the max_workers and semaphore limits based on your source database capacity. Start with conservative values and increase based on observed database load.

Integration with Data Connections

Profiling uses data connections registered in the platform’s Assets service:
  1. The data steward selects a registered data connection
  2. The profiling workflow retrieves connection credentials from the Secrets API at runtime
  3. A read-only database session is established for profiling queries
  4. All profiling queries use the read-only user configured during data source setup
Profiling queries can be resource-intensive on large tables. Schedule profiling runs during off-peak hours or configure sampling thresholds for tables with millions of rows.

Profiling Results

Profiling results are accessible via the Data Governance API:
# Get profiling results for a table
curl -X GET \
  "https://<host>:8001/data-readiness/data-assets/<asset-id>/profile" \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-Project-ID: <project-id>"
Results include:
{
  "table": "customers",
  "profiled_at": "2026-04-03T10:00:00Z",
  "row_count": 150000,
  "columns": [
    {
      "name": "email",
      "completeness": 0.98,
      "uniqueness": 0.97,
      "pattern": "email",
      "top_values": ["gmail.com", "outlook.com", "yahoo.com"]
    },
    {
      "name": "phone",
      "completeness": 0.72,
      "uniqueness": 0.95,
      "pattern": "phone_us",
      "null_count": 42000
    }
  ],
  "quality_score": 0.85
}

Scheduling Profiles

Profiling can be scheduled for recurring execution via the platform’s scheduling system:
  • Daily profiles: Track data quality trends over time
  • Weekly profiles: Standard cadence for most datasets
  • On-demand: Triggered manually for new data sources or after schema changes

Next Steps

Data Enrichment

Enrich metadata after profiling with LLM-powered descriptions.

Workflows

Learn about the Prefect workflow orchestration system.

Data Source Setup

Connect a database to start profiling.

Data Classification

Understand how profiling data is classified and protected.