Articles Tagged: tvm

1 article found

New AI Efficiency Breakthroughs Could Shrink LLM Costs and Put Powerful Models On‑Device

Two complementary ideas are redefining how far efficiency can go without retraining: principled post‑training quantization that pushes small models toward 2–3‑bit weights, and runtime precision control that adapts layer precision token by token. According to Exploring Layer‑wise Information Effectiveness for Post‑Training Quantization in Small Language Models (LieQ) and DP‑LLM: Runtime Model Adaptation with Dynamic Layer‑wise Precision Assignment, these approaches can reduce memory and latency while preserving quality for compact models. A broad survey, A Comprehensive Evaluation on Quantization Techniques for Large Language Models, puts GPTQ, AWQ, SmoothQuant, SpinQuant, OmniQuant and FP4 variants in context across tasks, while HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs targets KV‑cache memory for long‑context runs. Researchers have also demonstrated low‑bit execution on general‑purpose CPUs: Highly Optimized Kernels and Fine‑Grained Codebooks for LLM Inference on Arm CPUs presents 4‑bit and 2‑bit kernels with SIMD‑aware packing and measured throughput gains, and Selective Quantization Tuning for ONNX Models (TuneQn) offers a practical model‑selection workflow spanning ONNX Runtime and TVM. Together, the work points to tangible cloud cost relief and credible on‑device deployments—if teams respect deployment realities like runtime support, testing overhead, and quality guardrails.

post‑training quantizationdynamic precisionLieQ+12 more