Research Synthesized from 1 source

Apple Finds VLMs Leak Training Data Through Hidden Layers

Key Points

• Apple ML Research: probing residual streams reveals training data not in outputs
• First systematic comparison of information retention across VLM representational levels
• Vulnerability is architectural—output filtering cannot prevent internal representation leakage
• Research provides diagnostic framework for auditing model privacy-preserving properties
• Findings challenge assumption that controlling model outputs equals controlling learned information

References (1)

[1] Apple ML research reveals information leakage risks in vision-language models — Apple Machine Learning Research ↗

Picture a researcher at Apple running a standard "probe" on a vision-language model's internal activations. They've applied a simple linear classifier to the model's residual stream—the data highway where visual and textual information merge. What they find is alarming: private details from training images surface in ways the model owner never intended to expose. The model didn't generate this information. It leaked it.

This is the core finding of Apple's new research paper, "What Do Your Logits Know?" The work represents something the AI safety community rarely sees: concrete evidence that vision-language models don't just hallucinate—they architecturally retain information that owners assumed was inaccessible. The paper systematically demonstrates that probing a model's internal representations can extract training data that never appeared in any generated output.

The researchers focused on what they call "representational levels" within VLMs. As information flows through a model's residual stream, it passes through natural bottlenecks—low-dimensional projections where rich visual and textual data compress into denser forms. Apple's team found that these compression points don't discard information uniformly. Certain details survive in quantities sufficient for a simple linear classifier to detect them, even when the same information never surfaces in the model's explicit outputs.

This matters because the AI safety field has largely treated hallucination and leakage as problems of generation. Defenses focus on output filtering, content moderation, and instruction tuning—all approaches that address what comes out of the model, not what's embedded within it. Apple's research suggests this framework may be fundamentally incomplete. When a linear probe applied to hidden layers can extract information that sophisticated output-level safeguards successfully suppressed, the vulnerability isn't in the model's mouth—it's in the architecture's memory.

The distinction matters for how the industry should respond. Guardrails address symptoms. Architectural changes address causes. If information persists in residual stream representations despite compression, then effective solutions require rethinking how VLMs process and discard data—not just adding more filters at generation time.

Apple's contribution extends beyond identifying the problem. The paper provides the first systematic comparison of information retention across different representational levels in vision-language models. By mapping where and how information survives compression, the research establishes a methodology for auditing model architectures for privacy-preserving properties. Future VLM designs could, in principle, be evaluated for their tendency to retain sensitive information before deployment.

The implications ripple outward. Privacy regulations increasingly require organizations to control what data models have "learned." Apple's work suggests that controlling outputs isn't equivalent to controlling representations—a model might comply with every generation policy while still encoding sensitive information in its weights in ways that careful probing could recover. For enterprise deployments handling medical records, financial data, or personal images, this distinction carries real legal and ethical weight.

The research also connects to broader interpretability questions. Understanding which information survives compression helps researchers map what models actually represent versus what they merely generate. That mapping, in turn, feeds into efforts to build AI systems whose internal states align with their external behavior—a goal that remains elusive but increasingly urgent as these models embed deeper into critical infrastructure.

Apple's team stops short of prescribing specific architectural fixes. But their diagnostic framework points clearly toward one conclusion: the next generation of VLM safety research must focus on what happens inside the model, not just what emerges from it.