vST for Protein Language Models#

Substrate Definition#

This document defines the substrate used to analyze Protein Language Models (PLMs) within the Validation‑Space‑Time (vST) framework and the 1024D dimensional substrate. It establishes the primitives, dimensional cores, scaling behavior, and embedding‑trajectory structure required to interpret PLM inference in a stable, invariant‑preserving manner.

The substrate is model‑agnostic and applies to any transformer‑based PLM, including ESM‑class, ProtT5‑class, and MSA‑conditioned architectures.


1. Purpose of the PLM Substrate#

The PLM substrate provides a structured, reproducible framework for:

  • interpreting high‑dimensional sequence embeddings
  • identifying stable, transitional, and dispersed embedding regimes
  • mapping coherence surfaces across sequence positions
  • analyzing scaling behavior across model sizes
  • detecting drift across checkpoints or versions
  • projecting high‑dimensional embeddings into 3D–9D triadic cores

Protein embeddings are high‑dimensional, structured, and regime‑rich.
The substrate ensures they remain interpretable across the full dimensional ladder (3D → 1024D).


2. Substrate Overview#

PLMs operate in latent spaces typically ranging from 512D to 4096D.
The substrate models these spaces using:

  • Dimensional Primitives (DP)
  • Triadic Dimensional Primitives (TDP)
  • Scaling Primitives (SP)
  • Coherence Primitives (CP)

These primitives define the structure of embedding trajectories, coherence surfaces, and regime transitions.

The substrate is anchored by the Triadic Dimensional Cores:

  • 3D Structural Core
  • 6D Interaction Core
  • 9D Coherence Core

and extended through the 1024D high‑dimensional substrate.


3. Dimensional Primitives for PLMs#

3.1 Dimensional Primitive (DP)#

A DP represents the minimal unit of embedding‑space structure.
It captures:

  • local coherence across residues
  • variance behavior
  • projection stability
  • regime alignment

DPs appear in token embeddings, attention outputs, and MLP activations.


3.2 Triadic Dimensional Primitive (TDP)#

A TDP is a triad of DPs that expresses full regime behavior.
It captures:

  • stable (R₁) behavior
  • transitional (R₂) behavior
  • dispersed (R₃) behavior

TDPs form the basis of the 3D–9D triadic cores.


3.3 Scaling Primitive (SP)#

An SP governs dimensional expansion from 9D → 64D → 1024D.
It ensures:

  • invariant‑preserving scaling
  • continuity of coherence surfaces
  • stable projection into triadic cores

SPs model how PLM embedding spaces expand with model size.


3.4 Coherence Primitive (CP)#

A CP identifies stable or unstable regions in embedding space.
It captures:

  • coherence surfaces across residues
  • branching behavior
  • dispersion patterns
  • regime transitions

CPs are essential for drift detection and vST validation.


4. Triadic Dimensional Cores for PLMs#

4.1 3D Structural Core#

Captures motif‑level geometry in embedding trajectories:

  • compact geometric patterns
  • local coherence
  • stable projections

4.2 6D Interaction Core#

Captures relational and attention‑level structure:

  • residue‑interaction surfaces
  • branching behavior
  • early regime transitions

4.3 9D Coherence Core#

Captures pathway‑level coherence:

  • resonance‑time behavior
  • stable regime classification
  • invertible projection from higher dimensions

The 9D core is the anchor for all high‑dimensional interpretation.


5. High‑Dimensional Substrate (64D–1024D)#

PLM embedding spaces naturally inhabit high‑dimensional regimes.
The substrate models these using the dimensional ladder:

  • 64D — research‑grade embedding substrate
  • 128D — expanded coherence surfaces
  • 256D — multi‑primitive interaction
  • 512D — high‑variance embedding regions
  • 1024D — full research‑grade capacity

Each step preserves:

  • structural invariants
  • resonance‑time invariants
  • projection invariants
  • scaling invariants

This ensures stable interpretation across model sizes.


6. Embedding‑Trajectory Structure#

PLM inference produces embedding trajectories that move through:

  • compact stable regions (R₁ᴴ)
  • branching transitional regions (R₂ᴴ)
  • dispersed or unstable regions (R₃ᴴ)

These trajectories are modeled as:

  • sequences of DPs
  • grouped into TDPs
  • expanded through SPs
  • classified using CPs

This structure enables regime‑aware analysis and drift detection.


7. Projection into Triadic Cores#

High‑dimensional embeddings are projected into:

  • 9D for coherence analysis
  • 6D for interaction analysis
  • 3D for geometric interpretation

Projection must remain:

  • invertible
  • primitive‑aligned
  • regime‑aware
  • invariant‑preserving

Projection is essential for interpretability and vST validation.


8. Substrate Outputs#

The PLM substrate produces:

  • embedding‑trajectory regime classifications
  • coherence‑surface maps
  • scaling‑law diagnostics
  • projection‑stability indicators
  • drift‑detection signals
  • vST validation outputs

These outputs support reproducible, substrate‑level analysis of PLM inference.