Close Menu
    What's Hot

    Crypto Set to Soar as QT Ends and Global Stimulus Returns

    November 6, 2025

    New model design could fix high enterprise AI costs

    November 6, 2025

    Appeals Court Rejects Prisoner’s Lawsuit Over Alleged $354M Bitcoin Loss

    November 6, 2025
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    CryptoMarketVision
    • Home
    • AI News
    • Altcoin
    • Bitcoin
    • Business
    • Market Analysis
    • Mining
    • Trending Cryptos
    • Moneyprofitt
    • More
      • About Us
      • Contact Us
      • Terms and Conditions
      • Privacy Policy
      • Disclaimer
    CryptoMarketVision
    Home»AI News»Huawei's new open source technique shrinks LLMs to make them run on less powerful, less expensive hardware
    Huawei's new open source technique shrinks LLMs to make them run on less powerful, less expensive hardware
    AI News

    Huawei's new open source technique shrinks LLMs to make them run on less powerful, less expensive hardware

    adminBy adminOctober 4, 2025No Comments5 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email



    Huawei’s Computing Systems Lab in Zurich has introduced a new open-source quantization method for large language models (LLMs) aimed at reducing memory demands without sacrificing output quality.

    The technique, called SINQ (Sinkhorn-Normalized Quantization), is designed to be fast, calibration-free, and easy to integrate into existing model workflows. The code for performing it has been made available by the Huawei research team on Github and Hugging Face under a permissive, enterprise-friendly Apache 2.0 license, allowing organizations to take and use it, modify it, and deploy it commercially — all for free.

    Across models of different sizes, SINQ cuts memory usage by 60–70%, depending on architecture and bit-width.

    This enables models that would previously require >60 GB of memory to run on ~20 GB setups—a critical enabler for running large models on a single high-end GPU or even multi-GPU consumer-grade setups.

    This makes it possible to run models that previously needed high-end enterprise GPUs—like NVIDIA’s A100 or H100—on significantly more affordable hardware, such as a single Nvidia GeForce RTX 4090 (around $1600), instead of enterprise hardware like the A100 80GB ($19,000) or even H100 units that exceed $30,000.

    For teams using cloud infrastructure, the savings are similarly tangible. A100-based instances often cost $3–4.50 per hour, while 24 GB GPUs like the RTX 4090 are available on many platforms for $1–1.50 per hour.

    Over time, especially for extended inference workloads, this difference can add up to thousands of dollars in cost reductions, while also unlocking LLM deployment on smaller clusters, local workstations, or consumer-grade setups previously constrained by memory.

    Tackling the Memory Challenge of LLMs

    Running large models often requires compromises between performance and size.

    In practice, neural networks use floating-point numbers to represent both weights and activations. A floating-point number can express a wide range of values (very small, very large, with fractional parts).

    This flexibility is helpful because during training and inference, weights and activations can vary in scale dramatically. Using floating-point lets the model adjust precisely. (For example, a weight could be 0.0023 or 123.45, and floating-point can capture both with decent precision.)

    Quantization — a method that reduces the precision of model weights — offers a practical path to lower memory usage, but typically comes with trade-offs in model quality, especially at 4-bit precision and below.

    When you convert those floating-point values into lower-precision formats (like 8-bit integers), you’re approximating them.

    That means you store and compute with fewer bits, which is faster and more memory-efficient — but you risk losing fidelity (i.e. introducing small errors).

    The trick is to do the conversion carefully so the model’s behavior stays nearly the same, even though internally it’s working with rougher approximations of those weights and activations.

    SINQ addresses these pain points by introducing a plug-and-play solution that delivers strong performance even in low-precision settings—without requiring calibration data or inter-layer dependencies.

    How SINQ Works

    The SINQ approach introduces two main innovations:

    Dual-Axis Scaling: Instead of using a single scale factor for quantizing a matrix, SINQ uses separate scaling vectors for rows and columns. This helps mitigate the effects of outliers and allows the quantization error to be distributed more flexibly across the matrix.

    Sinkhorn-Knopp-Style Normalization: A fast algorithm inspired by Sinkhorn iterations is used to normalize the standard deviations of rows and columns in a matrix. This helps minimize what the authors call “matrix imbalance,” a new proxy metric shown to be more effective than alternatives like kurtosis for improving quantization performance.

    The combination of these two features allows SINQ to outperform other calibration-free techniques such as Round-To-Nearest (RTN), HQQ, and Hadamard-based quantization across multiple benchmarks.

    Performance and Compatibility

    SINQ has been evaluated across a wide range of architectures and models, including the Qwen3 series, LLaMA, and DeepSeek.

    On benchmarks like WikiText2 and C4, SINQ consistently reduces perplexity and flip rates compared to baseline methods, often approaching or matching the performance of calibrated solutions.

    It also supports non-uniform quantization schemes such as NF4 and can be combined with calibration methods like AWQ, leading to the variant A-SINQ. In calibrated settings, A-SINQ further narrows the gap with full-precision models.

    In terms of runtime efficiency, SINQ quantizes models roughly twice as fast as HQQ and over 30 times faster than AWQ. This makes it well-suited for both research and production environments where quantization time is a practical constraint.

    Open Source and Easy to Use

    Huawei has released SINQ as an open-source project under a permissive, enterprise-friendly Apache 2.0 license, with implementation instructions and reproducibility tools available on GitHub:

    The repository includes support for quantizing Hugging Face models with just a few lines of code, as well as tools for saving and reloading quantized weights. Default settings offer a balance between memory savings and accuracy, and users can customize parameters like bit-width, tiling strategy, and group size based on their needs.

    The authors also provide evaluation integration via the lm-eval library and plan to release pre-quantized models on the Hugging Face Hub in the near future.

    Looking Ahead

    With growing demand for running large models on consumer-grade hardware, quantization is becoming an essential tool. SINQ aims to lower the entry barrier for LLM deployment, enabling developers and researchers to efficiently shrink models without major trade-offs in quality or compatibility.

    Further updates—including integration with Hugging Face Transformers and pre-quantized model releases—are planned, making this a project to watch in the quantization space.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    admin
    • Website

    Related Posts

    New model design could fix high enterprise AI costs

    November 6, 2025

    The Hyundai Metaplant: A New Era in EV Manufacturing

    November 5, 2025

    Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem

    November 5, 2025

    Snowflake builds new intelligence that goes beyond RAG to query and aggregate thousands of documents at once

    November 4, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Crypto Set to Soar as QT Ends and Global Stimulus Returns

    November 6, 2025

    New model design could fix high enterprise AI costs

    November 6, 2025

    Appeals Court Rejects Prisoner’s Lawsuit Over Alleged $354M Bitcoin Loss

    November 6, 2025

    Subscribe to Updates

    Get the latest sports news from SportsSite about soccer, football and tennis.

    Welcome to Crypto Market Vision – your trusted source for everything crypto Our mission is simple: to make the world of cryptocurrency clear, accessible, and actionable for everyone. Whether you are a beginner exploring Bitcoin for the first time or a seasoned trader looking for market insights, our goal is to keep you informed, empowered, and ahead of the curve.

    Facebook X (Twitter) Instagram Pinterest YouTube
    Top Insights

    Crypto Set to Soar as QT Ends and Global Stimulus Returns

    November 6, 2025

    New model design could fix high enterprise AI costs

    November 6, 2025

    Appeals Court Rejects Prisoner’s Lawsuit Over Alleged $354M Bitcoin Loss

    November 6, 2025
    Get Informed

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • Contact Us
    • About Us
    • Terms and Conditions
    • Privacy Policy
    • Disclaimer

    © 2025 cryptomarketvision.com. All rights reserved. Designed by DD.

    Type above and press Enter to search. Press Esc to cancel.

    ethereum
    Ethereum (ETH) $ 3,386.35
    tether
    Tether (USDT) $ 1.00
    bitcoin
    Bitcoin (BTC) $ 103,342.69
    xrp
    XRP (XRP) $ 2.32
    bnb
    BNB (BNB) $ 950.28
    solana
    Wrapped SOL (SOL) $ 159.19
    usd-coin
    USDC (USDC) $ 0.999974
    dogecoin
    Dogecoin (DOGE) $ 0.16356