Articles Tagged: energy per token

2 articles found

Brain-Inspired Chips Could Slash AI Energy Use and Put LLMs on the Edge

What if a helpful language model could run all day on a smartwatch battery, or a data center could slash its AI power bill without sacrificing responsiveness? That promise is animating a surge of activity around brain-inspired chips. Neuromorphic processors fire only when events happen; memristor accelerators move math into memory arrays to avoid data shuttling. Recent papers show transformer-like workloads reimagined for event-driven execution, spiking attention mechanisms, and co-design strategies that train models to tolerate analog noise and drift. The remaining question is the one buyers care about: How do these systems stack up, end to end, against today’s GPUs and NPUs on energy per token, latency, and accuracy?

neuromorphic computingmemristorin-memory computing+16 more

New AI Efficiency Breakthroughs Could Shrink LLM Costs and Put Powerful Models On‑Device

Two complementary ideas are redefining how far efficiency can go without retraining: principled post‑training quantization that pushes small models toward 2–3‑bit weights, and runtime precision control that adapts layer precision token by token. According to Exploring Layer‑wise Information Effectiveness for Post‑Training Quantization in Small Language Models (LieQ) and DP‑LLM: Runtime Model Adaptation with Dynamic Layer‑wise Precision Assignment, these approaches can reduce memory and latency while preserving quality for compact models. A broad survey, A Comprehensive Evaluation on Quantization Techniques for Large Language Models, puts GPTQ, AWQ, SmoothQuant, SpinQuant, OmniQuant and FP4 variants in context across tasks, while HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs targets KV‑cache memory for long‑context runs. Researchers have also demonstrated low‑bit execution on general‑purpose CPUs: Highly Optimized Kernels and Fine‑Grained Codebooks for LLM Inference on Arm CPUs presents 4‑bit and 2‑bit kernels with SIMD‑aware packing and measured throughput gains, and Selective Quantization Tuning for ONNX Models (TuneQn) offers a practical model‑selection workflow spanning ONNX Runtime and TVM. Together, the work points to tangible cloud cost relief and credible on‑device deployments—if teams respect deployment realities like runtime support, testing overhead, and quality guardrails.

post‑training quantizationdynamic precisionLieQ+12 more