A trillion-parameter language model just ran on a MacBook with less RAM than a budget gaming PC. Kimi K2.5—the colossal AI system from Moonshot AI—executed inference using only 32 billion active weights at any moment, loaded into 96GB of unified memory on an M2 Max MacBook Pro. The technique enabling this feat, called streaming experts, represents a fundamental shift in how we think about deploying massive neural networks on consumer hardware.
Traditional large language models require fitting every parameter into RAM simultaneously. A model with one trillion parameters demands roughly 2 terabytes of memory at fp16 precision—far beyond any consumer device. Streaming experts sidesteps this constraint by exploiting the sparse activation pattern inherent to Mixture-of-Experts (MoE) architectures. Rather than loading the entire model, the system stores expert weights on the SSD and streams only the relevant experts for each token from storage, fetching approximately 12GB per token from disk.
The progression from concept to consumer hardware has been remarkably rapid. Just five days earlier, researcher Dan Woods demonstrated Qwen3.5-397B-A17B running in 48GB of RAM. The jump to trillion-parameter scale on readily available hardware occurred within a single workweek of optimization by the tinkerer community. @seikixtc achieved the Kimi K2.5 milestone, while @anemll separately demonstrated Qwen3.5-397B running on an iPhone at 0.6 tokens per second—comparable to early dial-up internet speeds, but functioning nonetheless on a device with no active cooling and severe thermal constraints.
The implications extend beyond mere novelty. Streaming experts challenges the prevailing assumption that frontier AI requires data-center infrastructure. Kimi K2.5's 32 billion active weights match the parameter count of models like Llama 3.1 70B, yet the streaming approach reduces memory requirements by roughly 96%. For researchers and developers without access to expensive cloud GPU clusters, this technique opens new experimental possibilities that were previously economically prohibitive.
The approach does introduce tradeoffs. Fetching weights from SSD adds latency that pure RAM-based inference avoids. Token generation speeds remain modest compared to optimized cloud deployments. However, the technique continues advancing—autoresearch loops are actively optimizing the streaming pipeline, and each iteration squeezes additional performance from the same hardware constraints.
Whether this democratizes access to frontier AI or merely shifts the bottleneck to storage bandwidth remains an open question. What is certain: the assumption that trillion-parameter models require trillion-dollar infrastructure is no longer unassailable.