Pig Butchering 2.0: LLM Code Provenance Graphs

Use “Code Provenance Graphs” that map AI-generated snippets back to their originating sources, including public repos and dataset versions. These graphs allow security teams to see whether a piece of code originated from a trustworthy project or a malicious supply-chain insertion.
‍

Current tools like dependency graphs or Software Composition Analysis (SCA) focus on binary and package dependencies, not semantic and generative origins of AI authored code yet.
‍

Code Provenance Graphs combine semantic AST mapping with model-source linking, reconstructing a directed acyclic graph(DAG) of generative lineage for every AI-assisted commit.

‍

Core Components:

3.1 AST Level Semantic Mapping

Parse AI-generated snippets into Abstract Syntax Trees (ASTs) and fingerprint sub-trees via hash-based canonicalization (Merkle tree format).
Compare to known public repositories using fuzzy-hashing (e.g., SSDeep, SimHash) and semantic embedding similarity.
‍

3.2 Generative Source Attribution

Cross-reference snippet fingerprints with a database of LLM-training data indices (public GitHub corpora, open datasets).
Identify probable origin projects using vector similarity search (e.g., FAISS index > 0.9 cosine similarity = possible source).
‍

3.3 Trust Edge Weights

Each edge in the graph (source → snippet) carries a trust score calculated from:
Trust = Repo_Reputation × Contributor_Score × Temporal_Recency × Integrity_VerificationThe Trust Score dynamically decays over time or with unverified contributors.
‍

3.4 Visualization & Integration

Render via graph-databases like Neo4j or AWS Neptune, enabling visual inspection of lineage.
Integrate with CI/CD via signed commit metadata to automatically halt builds if code paths trace back to low-trust sources.
‍

3.5 Standardization Route

Extend CycloneDX or SLSA provenance frameworks to support ai-origin and trust-score nodes.
Optionally anchor graph digests on blockchain for immutability.

‍

Outcome:

CPGs transform source attribution into a forensic, query-able graph, giving SOC and DevSecOps teams visibility into where AI-generated code truly came from, and how much it can be trusted.
Code Provenance Graphs act as the correlation engine between visibility and behavior.
They ingest lineage data from AI SBOMs, integrate risk scores from Honeypot telemetry, and respond dynamically to IDE alerts from Prompt Anomaly Detection.
When anomalies are confirmed, the graph propagates trust-score updates back to the SBOM registry, enforcing end-to-end provenance integrity across the software lifecycle.

‍

Upcoming:

The authors have spread their research, insights and conclusion into the last of four parts in the Pig Butchering series:

Part 4/4 - Prompt Anomaly Detection in IDEs

Detailed material for access – GitHub

Co-authored by

Venkatakrishnan Jayakumar is a seasoned cloud and DevOps leader with over two decades of experience transforming enterprise IT—from physical infrastructure deployments to cloud-native, scalable architectures. His expertise spans infrastructure migration, cloud architecture, Kubernetes, and automation, helping organizations accelerate time to market without compromising security or reliability.

Before joining Infiligence, Venkat led the DevOps and Cloud Center of Excellence at Concentrix Catalyst, delivering scalable solutions for global enterprises like Honeywell and Charter. Earlier, he drove large-scale data center migrations at Zurich and engineered modern infrastructure solutions involving blade servers and enterprise storage systems.

At his core, Venkat is passionate about building secure, resilient, and high-performing platforms that empower businesses to innovate with confidence.
‍

Connect on LinkedIn

‍

Ajitha Ravichandran is an experienced QA engineer with a strong background in automation testing, CI/CD integration, and quality engineering for cloud-native applications. She brings hands-on expertise in designing and implementing robust testing frameworks that ensure secure, scalable, and high-performing enterprise solutions.

At Infiligence, Ajitha focuses on building next-generation platform engineering solutions that unify security, observability, and automation—helping clients achieve faster delivery cycles and stronger operational governance.

Her earlier experience spans cloud migration projects, Kubernetes deployments, and automation frameworks that streamline application lifecycle management across hybrid and multi-cloud ecosystems. Ajitha is passionate about driving engineering excellence and enabling teams to build with confidence in the cloud.