1. Data Governance Fundamentals

What Data Governance Covers

Data governance is the system of policies, processes, and roles that defines who is accountable for data assets, how data is used, and how it is protected throughout its lifecycle.

  • Inventory: Discovering and cataloging all data assets — databases, file shares, cloud storage, SaaS applications, data warehouses, analytics platforms
  • Classification: Labeling data by sensitivity and regulatory category — determines how it must be protected, retained, and handled
  • Ownership: Assigning business owners accountable for data quality, access decisions, and retention for each data asset
  • Quality: Defining standards for accuracy, completeness, consistency — critical for AI/ML training data governance
  • Lineage: Tracking where data originated and how it transformed through pipelines — required for GDPR right to erasure and change impact analysis
  • Access Control: Governing who can access which data under what conditions — implemented through IAM policies, database permissions, and data masking
  • Retention: Defining how long data must be kept (regulatory minimums) and when it must be deleted (privacy regulations)

Governance Council Structure

Effective data governance requires clear accountability at multiple organizational levels. The governance council coordinates policy; execution happens in the business and IT teams.

  • Data Owners (Business): Senior business leaders accountable for data assets within their domain. Customer data is owned by VP Sales/Marketing; financial data owned by CFO. Make access decisions, define retention, approve data sharing.
  • Data Stewards (Operational): Day-to-day data quality and governance practitioners within business units. Apply classifications, enforce access policies, manage data quality issues, run access reviews for their domain.
  • Data Custodians (IT/Security): Implement technical controls mandated by owners and stewards. Database admins, cloud architects, security team. Set up access controls, encryption, backup policies, audit logging.
  • Chief Data Officer (CDO): Executive accountable for enterprise data strategy, governance program, and regulatory data compliance. Chairs the data governance council.
  • Governance Council: meets monthly, reviews policy exceptions, approves new data sharing arrangements, tracks compliance metrics, resolves ownership disputes

Governance Failures & EU AI Act

Poor data governance creates compounding security and compliance risk. The EU AI Act has introduced new data governance requirements that make these failures even more costly.

  • Shadow IT Data Warehouses: Business teams create their own copies of data in unauthorized tools (Airtable, Google Sheets, personal S3 buckets) — invisible to governance controls
  • Ungoverned Data Lakes: S3 buckets or ADLS Gen2 containers filled with "raw" data with no ownership, no classification, no access review — the data swamp problem
  • GDPR Fines from Data Ignorance: Organizations cannot fulfill right-to-erasure requests or data access requests if they don't know where personal data lives. Meta has received fines exceeding €1.2 billion for data governance failures.
  • EU AI Act Requirements: High-risk AI systems must have documented training data governance including data sources, data collection methodology, bias testing, and data quality measures. Ungoverned training data is a regulatory violation for AI systems.
  • Orphaned Datasets: Data assets where the original owner has left the organization. No one understands the data, no one approves access — becomes either over-accessible or inaccessibly locked.

2. Data Classification

Classification Tiers

A four-tier classification model is the most widely adopted approach. Each tier carries different handling requirements that are enforced through technical controls.

  • Public: Information approved for public release — press releases, public website content, published research. Minimal handling requirements; cannot cause harm if disclosed.
  • Internal: General business information not intended for external parties — internal procedures, org charts, non-sensitive emails. Standard access controls; no special handling beyond organizational boundary.
  • Confidential: Business-sensitive information that would cause material harm if disclosed — financial projections, HR records, customer contracts, merger discussions. Strong access controls, audit logging, encryption at rest required.
  • Restricted / Secret: Highest sensitivity — PII, PHI, payment card data, trade secrets, legal privileged communications, security vulnerability details. Most restrictive access, mandatory encryption, DLP controls, detailed audit trail.

Automated Classification Tools

Manual classification at enterprise scale is impossible. Automated tools scan data repositories, identify sensitive content patterns, and apply or recommend classification labels.

  • Microsoft Purview Information Protection (MIP): Native Microsoft solution with sensitivity labels that integrate across Microsoft 365 (Exchange, SharePoint, Teams, OneDrive). Auto-labeling policies scan content and apply labels based on trainable classifiers and sensitive info types.
  • Nightfall AI: API-based sensitive data discovery and classification using ML models for context-aware detection (not just regex). Integrates with Slack, GitHub, Google Drive, Jira, Confluence.
  • BigID: Enterprise data intelligence platform — discovers, classifies, and catalogs personal data across all enterprise data sources for privacy compliance. Strong for GDPR/CCPA data mapping.
  • Varonis: Data access governance with classification built in — discovers sensitive data and flags who has access to it versus who should.
  • Classification decisions drive DLP policy enforcement, encryption requirements, access control decisions, and data retention schedules — classification is the foundation, not an end in itself
ClassificationExamplesAccess ControlEncryptionDLP PolicyRetention
PublicPress releases, public docsAll — no authenticationIn transit onlyNoneIndefinite or as needed
InternalInternal procedures, org chartsAll employees (authenticated)In transit; at rest optionalBlock external email3-7 years or business need
ConfidentialFinancial data, HR records, contractsRole-based, need-to-knowAt rest + in transit mandatoryBlock uploads, watermark7 years (finance), varies
RestrictedPII, PHI, PAN, trade secretsMinimum necessary, MFA requiredAES-256 mandatory + key managementBlock all external movementRegulatory minimum then delete

3. Data Catalog & Lineage

Data Catalog

A data catalog is a searchable inventory of all data assets in the organization — databases, tables, columns, files, APIs, and their metadata. It answers the question "what data do we have and what does it mean?"

  • Alation: Enterprise data catalog with collaboration features. Business users can find and understand data assets, add annotations, and request access. Integrates with most data platforms and BI tools.
  • Collibra: Data intelligence platform with strong policy management, stewardship workflows, and compliance features. Preferred for highly regulated industries (finance, healthcare).
  • Apache Atlas: Open-source metadata management and data governance framework from the Hadoop ecosystem. Integrates natively with Hive, HBase, Kafka, Spark. Self-hosted; requires significant operational investment.
  • AWS Glue Data Catalog: Managed metadata repository for AWS data lake. Automatically populated by Glue crawlers that scan S3 buckets. Used by Athena, EMR, and Redshift Spectrum for query planning.
  • DataHub (LinkedIn OSS): Modern open-source metadata platform with strong lineage capabilities. REST API for programmatic metadata ingestion. Active community with broad connector support.
  • Catalog ROI: data scientists spend 80% of their time finding and understanding data. A catalog reduces this to hours instead of days.

Data Lineage

Data lineage tracks the origin, movement, and transformation of data through pipelines and systems. It answers "where did this data come from?" and "what systems are downstream of this data source?"

  • GDPR Right to Erasure: To comply with a deletion request, you must know every system where personal data from that subject exists — including copies in analytics pipelines, backups, and derived datasets. Lineage makes this tractable.
  • Column-Level Lineage: Tracks individual column transformations through SQL transformations and data pipelines. Essential for compliance: if SSN from CRM flows into 5 analytics tables, you need to erase all 5.
  • Impact Analysis: Before changing a source table schema, lineage shows which downstream dashboards, ML models, and reports will break. Prevents unplanned outages.
  • dbt (data build tool): Modern SQL transformation framework that automatically generates column-level lineage as part of the build process. Integrates with DataHub, Amundsen, OpenMetadata.
  • OpenLineage: Open standard (Apache) for collecting and representing lineage metadata across all data pipeline tools. Implemented by Spark, Flink, dbt, Airflow, and 40+ tools.

4. Data Access Governance

Database Activity Monitoring (DAM)

DAM platforms monitor and audit all database activity — who accessed what data, when, from which application, using which query — in real time. Essential for detecting insider threats and unauthorized data access.

  • Imperva Data Security: Network-based and agent-based monitoring across Oracle, SQL Server, MySQL, PostgreSQL, MongoDB. Real-time alerting on sensitive data access, mass SELECT, schema changes, and privileged user activity.
  • IBM Guardium: Enterprise DAM with strong compliance reporting for SOX, PCI DSS, HIPAA. Behavioral analytics to detect unusual access patterns. Data discovery and classification built in.
  • Database native audit logging: most databases have native audit capabilities (SQL Server Audit, Oracle Unified Auditing, MySQL Audit Plugin, PostgreSQL pgaudit) — enable and forward to SIEM.
  • Key alerts: after-hours bulk data export, access to tables outside normal role, access to data by terminated employee accounts, DDL changes (DROP TABLE, ALTER TABLE) in production

Fine-Grained Data Access Controls

Modern data platforms support fine-grained access controls below the table level — column-level security restricts access to sensitive fields, row-level security restricts access to relevant records.

  • BigQuery Column-Level Security: Tag sensitive columns with taxonomy tags. IAM policy bindings on tags restrict which principals can see column values — untagged or restricted columns return NULL for unauthorized users.
  • BigQuery Row-Level Security: Row Access Policies filter query results so users only see rows matching their access conditions. A regional sales manager sees only their region's data from the same table.
  • Snowflake Dynamic Data Masking: Masking policies on columns return different values based on the querying role. Role "analyst" sees last 4 digits of SSN; role "hr_admin" sees full SSN. Applied at query time, not storage time.
  • PostgreSQL Row Security: Row-level security policies using CREATE POLICY. Enable with ALTER TABLE ... ENABLE ROW LEVEL SECURITY. Policies filter rows based on session variable or role.
  • Column-level encryption at the application layer (application encrypts PII before storing, stores ciphertext) provides even stronger protection — DAM and RLS can't protect against a DBA dump
PlatformColumn-Level SecurityRow-Level SecurityDynamic MaskingAudit Logging
BigQueryYes (taxonomy tags + IAM)Yes (row access policies)Yes (data masking rules)Yes (Cloud Audit Logs)
SnowflakeYes (column masking policies)Yes (row access policies)Yes (dynamic masking)Yes (QUERY_HISTORY, ACCESS_HISTORY)
DatabricksYes (Unity Catalog)Yes (row filters)Yes (column masks)Yes (Unity Catalog audit)
PostgreSQLNo native (use views/grants)Yes (CREATE POLICY)No nativeYes (pgaudit extension)
SQL ServerYes (column permissions)Yes (RLS)Yes (Dynamic Data Masking)Yes (SQL Server Audit)
Amazon RedshiftYes (column-level grants)Yes (RLS)Yes (dynamic data masking)Yes (Redshift Audit Logging)

5. Data Retention & Deletion

Retention Schedules by Data Type

Retention schedules balance regulatory minimum retention requirements against privacy regulations that mandate deletion. Where these conflict, documented legal analysis determines the applicable standard.

  • Financial Records: IRS Form requirements: 7 years. SOX: 7 years for audit workpapers. FCPA: 5 years for books and records.
  • HR Records: EEOC regulations: 1 year post-termination for most records. ADA/FMLA: 3 years. DOL records: 3 years. General HR files: 7 years post-employment is common practice.
  • Medical / PHI: HIPAA: 6 years from creation or last effective date. State laws may extend to 10+ years for medical records. Minor patients: until age of majority plus the state statutory period.
  • Security Logs: PCI DSS: 1 year minimum (3 months immediately available). SOC 2: varies by control objective. NIST 800-92: based on risk assessment. Best practice: 1 year online, 2 years archived.
  • Email: Varies by content classification. Legal hold overrides all schedules. Otherwise: 3-7 years for business correspondence, delete based on policy for personal use.
  • GDPR Conflict: Right to erasure requests may conflict with mandatory retention. Resolution: retain for legal minimum, then delete; document the legal basis for retention in privacy notices.

Automated Retention Policies

Manual deletion processes are unreliable. Automated lifecycle policies in cloud platforms ensure data is deleted on schedule without human intervention.

  • S3 Lifecycle Policies: Transition objects between storage classes (S3 Standard → Glacier → Deep Archive) and expire objects after a defined number of days. Apply to buckets or by object prefix/tag.
  • Azure Blob Lifecycle Management: Policy-based tiering and deletion rules. Can be scoped to containers or blob prefixes. Immutability policies can prevent premature deletion for regulatory retention.
  • Google Cloud Storage Object Lifecycle: Rules that delete or change storage class based on age, version count, or creation date.
  • Database Partition Pruning: Time-partitioned tables in BigQuery, Snowflake, PostgreSQL TimescaleDB allow dropping old partitions as a single metadata operation rather than row-by-row deletion.
  • Legal Hold: Automated retention policies must be overridable by legal hold — a flag that pauses deletion for specific records under litigation hold. Systems must support this without manual intervention for every record.

Secure Deletion Standards

Deleting a file or record does not guarantee it cannot be recovered. Secure deletion requires appropriate technique matched to the media type and threat model.

  • NIST SP 800-88 Clear: Overwrite addressable storage locations with a standard pattern. Appropriate for magnetic media that will remain in organizational control. Sdelete on Windows, shred on Linux.
  • NIST SP 800-88 Purge: Cryptographic erase (for SSDs and flash) or degaussing (for magnetic). Protects against laboratory attacks. Required before media reuse outside the organization.
  • NIST SP 800-88 Destroy: Physical destruction — shredding, incinerating, disintegrating. Required for highest-sensitivity media or when Purge cannot be applied. Certificate of destruction provided by vendor.
  • Cryptographic Erasure: For cloud environments — encrypt the data with a dedicated key, then delete the key. The encrypted data becomes unrecoverable without the key. AWS KMS key deletion, Azure Key Vault key deletion. Most efficient for large-scale cloud data deletion.
  • Cloud data deletion verification: after deletion, request a certificate of destruction from the cloud provider for compliance evidence. Providers have processes for this in regulated industries.

Data You Don't Need is Liability, Not Asset

Every dataset your organization retains beyond its useful life is a liability: it may contain PII subject to privacy regulations, it is discoverable in litigation, it is a potential target in a data breach, and it costs money to store and govern. A documented and automated data deletion program — classifying data, assigning retention schedules, and deleting on schedule — reduces your breach impact, simplifies regulatory compliance responses, and lowers storage costs simultaneously. The GDPR principle of storage limitation makes this an explicit legal requirement, not just good practice: personal data may only be kept for as long as necessary for the stated purpose.