Recent advancements in large language models (LLMs) like GPT-3 have unlocked incredible capabilities in natural language processing. However, running these massive models can be extremely computationally intensive, often requiring powerful cloud infrastructure. This article explores the key concepts and techniques for running LLM inference locally on your own hardware.
Model Inference
Before we dive in, it’s important to distinguish between the two main phases of working with LLMs:
- Training: This is the process of teaching the model by feeding it large amounts of data. The model learns patterns and relationships in the data, encoding this knowledge into its internal weights and parameters. Training LLMs is extremely resource-intensive, requiring huge datasets, lots of computing power (often distributed across many GPUs), and a long time. This is typically done by large tech companies or AI research labs.
- Inference: This is the process of using a trained model to generate outputs (like text completions) based on new input prompts. Inference is much less computationally intensive than training, and can often be done on a single GPU or even CPU, depending on the model size and latency requirements. This is what allows individuals to run LLMs locally.
CPU vs GPU for Inference
While inference is less demanding than training, locally running an LLM like GPT-3 (with 175 billion parameters) still requires substantial compute resources, particularly in terms of memory to store the model weights and perform the calculations.
CPUs, while versatile, are not optimized for the kind of massively parallel matrix math operations required for neural network inference. They tend to have less memory bandwidth and higher latency accessing memory.
GPUs, on the other hand, are purpose-built for rapid parallel processing and have high-bandwidth memory access. A GPU can perform trillions of math operations per second, making them ideal for neural network calculations. GPU memory (VRAM) is also higher-bandwidth than system RAM.
GPU Frameworks
To take advantage of GPUs, you need special programming frameworks:
- CUDA: NVIDIA’s proprietary parallel computing platform and API. It allows developers to use a CUDA-enabled GPU for general purpose processing (GPGPU). Many deep learning frameworks, like PyTorch and TensorFlow, interface with CUDA for NVIDIA GPU acceleration. However, CUDA only works on NVIDIA GPUs.
- Metal: Apple’s low-level API for GPU programming on Apple devices. It’s optimized for the proprietary GPUs in Macbooks, iMacs, iPhones, and other Apple silicon.
- OpenCL: An open standard for parallel programming across heterogeneous platforms — CPUs, GPUs, and other accelerators. It allows you to write code that executes on different types of processors. OpenCL is commonly used to run compute on AMD GPUs.
- BLAS: BLAS (Basic Linear Algebra Subprograms) serves as the foundation for GPU-accelerated linear algebra operations. These highly optimized routines handle essential operations like vector addition, matrix multiplication, and dot products. Major GPU vendors provide their own BLAS implementations — NVIDIA offers cuBLAS for CUDA, while AMD developed clBLAS for OpenCL. Many popular machine learning frameworks like PyTorch and TensorFlow leverage these GPU BLAS libraries to accelerate their tensor operations.
- HIP: The framework enables developers to write code that runs on both NVIDIA and AMD GPUs. It closely mirrors CUDA’s syntax, allowing for straightforward code conversion between the two platforms. Developers can compile HIP code using either NVIDIA’s NVCC compiler or AMD’s HIP compiler, making it a versatile choice for cross-platform GPU development. HIP integrates seamlessly with AMD’s ecosystem, including rocBLAS for linear algebra and MIOpen for deep learning.
- Vulkan: The framework represents a modern approach to graphics and compute programming. While primarily designed for graphics rendering, it includes compute shader capabilities that enable general-purpose GPU computing. These compute features particularly benefit mobile and embedded systems, with frameworks like ncnn and MNN utilizing Vulkan for faster neural network inference on mobile devices. Though Vulkan requires more verbose code compared to CUDA or OpenCL, its broad platform support makes it valuable for specific compute workloads, especially in mobile contexts.
Efficient Model Formats
The open source llama.cpp project enables running large language models like LLaMA, GPT-J, GPT-NeoX, and Cerebras-GPT on consumer hardware, even on CPUs. It achieves this by using clever techniques and efficient model formats. Let’s take a closer look at these formats and how they differ.
Quantization and GPTQ
One key technique used by llama.cpp is quantization — reducing the precision of the model’s weights. Neural networks are typically trained using 32-bit floating point (FP32) numbers, which offer high precision but consume a lot of memory. Quantization techniques like GPTQ (Generalized Perplexity weighting using Threshold Quantization) reduce the precision to 4-bit integers (INT4).
This has two main benefits:
- Reduces the model size by 8x, since INT4 weights take up 1/8th the memory of FP32.
- Speeds up computation, because CPUs and GPUs can perform INT4 calculations much faster than FP32.
GPTQ is able to maintain most of the model quality despite the dramatic reduction in precision. This is a key innovation that allows huge models to run on limited hardware.
File Formats
Building on top of the GPTQ quantization, there are several custom model formats that are optimized for efficient inference in different scenarios. Let’s take a look at each one:
GGML (Gerald’s Machine Learning) is the foundational format. It stores the model weights in the quantized INT4 format produced by GPTQ. GGML files are designed to be loaded directly into CPU or GPU memory for fast evaluation, without any additional processing or decoding steps. This simplicity makes GGML easy to work with and supported by a wide range of tools.
GGJT (Gerald’s Gigantic JSON Transformer) takes things a step further in terms of optimization. It allows the model to be sharded into smaller parts that can be loaded and processed in parallel. This is especially beneficial for models that are too large to fit into the memory of a single GPU or CPU. By breaking the model into shards, GGJT enables parts of the model to be loaded as needed during inference, reducing the memory footprint and allowing for faster startup times.
GGUF, short for GGML Ultra Fast, is a binary format that is designed for rapid loading and inference. It builds upon the GGML format but adds additional optimizations to minimize the time spent loading the model into memory. This makes GGUF ideal for scenarios where fast startup times are critical, such as in serverless environments where models need to be loaded on-demand for each request.
GGMF is a compact variant of GGML that employs aggressive compression techniques to minimize the on-disk size of the model. This includes delta encoding of the weights, as well as custom compression algorithms. The result is a model that can be up to 30% smaller than GGML, while still being fast to load and evaluate. This size reduction is especially valuable for deploying models to edge devices or for distributing them over limited bandwidth networks.
EXL2, short for Extreme Latency Lambada Lite, is the most highly optimized format. It employs all of the techniques of GGMF and adds additional low-level optimizations to squeeze out every last bit of performance. This includes techniques like custom memory layouts and assembly-level optimizations. EXL2 is designed for the most performance-critical scenarios where every microsecond counts, such as in high-frequency trading or real-time bidding systems.
So, how do you choose between these formats? It depends on your specific needs and constraints:
- If compatibility and ease of use are your top priorities, GGML is the way to go.
- If you need to load and run very large models, GGJT’s parallel sharding can be a lifesaver.
- For applications that need to start up and respond as quickly as possible, GGUF’s fast loading is ideal.
- When distributing models to edge devices or over limited networks, GGMF’s small size is a big advantage.
- And for the most latency-sensitive applications where every microsecond matters, EXL2 is the ultimate optimization.
ONNX Format
ONNX (Open Neural Network Exchange) is an open format for representing machine learning models, including LLMs. It allows models to be transferred between different frameworks and tools.
The main difference between ONNX and formats like GGML/GGJT is the level of optimization. ONNX is a general-purpose format that retains the full precision (usually FP32) of the model weights. It is not quantized and thus consumes more memory and compute compared to GGML/GGJT.
Types of Large Language Models
Modern Large Language Models (LLMs) share their foundation in Google’s 2017 transformer architecture, but have evolved into distinct variants through different training methods, architectures, and optimization approaches. This architectural diversity isn’t just about capabilities — it directly impacts how models can be deployed in production. The choice of model architecture influences the entire technology stack, from required hardware (GPU vs. CPU optimization) to serving frameworks (like vLLM or TGI), and even affects which optimization techniques can be applied.
Organizations must consider these deployment implications alongside model performance when selecting an LLM for production use, as the architecture effectively determines both the operational costs and technical requirements of the entire inference pipeline.
Decoder-only Transformers
- Architecture: Process text sequentially by predicting the next token based on previous context
- Key Features:
— Autoregressive generation
— Efficient for text generation tasks
— Generally larger parameter counts - Examples:
— GPT-4 (OpenAI)
— DeepSeek-R1 (specialized in reasoning and code)
— Qwen2 (Alibaba)
— Llama-3.1 (Meta, focuses on efficiency)
— Mistral (optimized with grouped-query attention)
Encoder-Decoder Models
- Architecture: Process entire input sequence before generating output
- Key Features:
— Bidirectional context understanding
— Well-suited for transformation tasks
— Natural handling of structured outputs - Examples:
— T5 (Google)
— BART (Facebook)
— PaLM (Google)
Sentence Transformers
- Architecture: Specialized in creating dense vector representations of text
- Key Features:
— Optimized for semantic similarity
— Smaller and more efficient
— Focus on embedding generation - Examples:
— all-MiniLM (Microsoft)
— SBERT
— SimCSE
Hybrid/Specialized Models
- Architecture: Combine multiple approaches or focus on specific domains
- Key Features:
— Task-specific optimizations
— Often smaller but more focused
— Balance between performance and efficiency - Examples:
— Gemma-2 (Google’s efficient hybrid approach)
— Phi-3 (Microsoft’s compact but powerful model)
— Code-LLama (specialized for programming)
Inference Techniques
In recent years, several frameworks and techniques have been developed to optimize the local inference of large language models. These tools aim to make it possible to run LLMs on consumer-grade hardware, such as a single GPU or even a CPU, by employing various optimization techniques. Let’s take a look at some of the most prominent ones.
vLLM
vLLM is an optimized GPU inference backend for LLaMA and other transformer-based models. It leverages NVIDIA’s CUDA platform to parallelize computation across the thousands of cores in a GPU.
vLLM employs several key techniques to maximize performance:
- Kernel fusion: vLLM fuses multiple computation steps into a single GPU kernel. This reduces the overhead of launching separate kernels and transferring data between them.
- Optimized data layout: vLLM carefully arranges data in GPU memory to maximize cache utilization and minimize memory access latency.
- Reduced precision: vLLM uses reduced-precision arithmetic (e.g., FP16) where possible to increase computation throughput without sacrificing accuracy.
These optimizations allow vLLM to achieve very high inference speeds on consumer GPUs. It’s a good choice when you have a powerful GPU and need the fastest possible inference.
TGI
TGI, short for Transparent GPU Inference, is a PyTorch extension that enables the inference of large models that exceed the GPU’s memory capacity. It achieves this by transparently offloading unused model weights to CPU memory, and prefetching them back to the GPU just before they’re needed.
TGI’s key innovation is its ability to do this offloading and prefetching without requiring any changes to the model code. From the perspective of the PyTorch model, it looks like it has access to a giant GPU memory space, even though in reality, only a small portion of the model is on the GPU at any given time.
TGI is a good choice when you have a model that’s too large to fit on your GPU, but you still want to benefit from GPU acceleration. It allows you to run models with up to 160B parameters on a single consumer GPU.
TensorRT LLM
TensorRT LLM is an optimization toolkit developed by NVIDIA to accelerate LLM inference on NVIDIA GPUs. It leverages the TensorRT library, which is a high-performance deep learning inference optimizer and runtime.
TensorRT LLM can import models in various formats, including PyTorch, TensorFlow, and ONNX. It then optimizes the model for inference on NVIDIA GPUs by applying techniques like:
- Layer fusion: TensorRT fuses compatible layers in the model graph to reduce the number of GPU kernel launches.
- Precision calibration: TensorRT can automatically calibrate the model to use reduced precision (e.g., FP16 or INT8) where possible, without sacrificing accuracy.
- Kernel auto-tuning: TensorRT automatically tunes the GPU kernels for the specific model and hardware to maximize performance.
TensorRT LLM is a good choice when you’re deploying a model on NVIDIA hardware and want an easy way to get the best possible performance.
Triton VLLM
Triton VLLM is an extension to the NVIDIA Triton Inference Server that enables the efficient serving of very large language models. Triton is an open-source inference serving platform that allows you to deploy models from any framework (TensorFlow, PyTorch, ONNX, etc.) as a scalable and performant web service.
Triton VLLM extends Triton with capabilities specifically designed for large language models:
- Multi-GPU inference: Triton VLLM can distribute a model across multiple GPUs, allowing you to serve models that are larger than a single GPU’s memory capacity.
- Dynamic batching: Triton VLLM can automatically batch together requests that arrive within a short time window, improving throughput without increasing latency.
- Sequence length batching: Triton VLLM can batch together requests with similar sequence lengths, which is more efficient than padding all sequences to the maximum length.
Triton VLLM is a good choice when you need to serve a large language model as a web service, especially if you expect a high volume of requests and have multiple GPUs available.
DeepSpeed MII
DeepSpeed MII is a Microsoft-developed library for optimizing model inference on commodity hardware. It’s part of the larger DeepSpeed library, which provides a variety of tools for training and inference of deep learning models.
DeepSpeed MII employs several techniques to enable efficient inference of large models:
- Tensor parallelism: DeepSpeed MII can split a model across multiple GPUs, similar to Triton VLLM. This allows it to handle models larger than a single GPU’s memory.
- ZeRO Inference: This is a technique that offloads model weights to CPU memory and prefetches them back to GPU as needed, similar to TGI. This allows running large models on a single GPU.
- Operator optimization: DeepSpeed MII includes optimized implementations of common operators used in transformer models, such as attention and feedforward layers.
DeepSpeed MII is a good choice when you’re working with very large models and need to parallelize across multiple GPUs or optimize for limited GPU memory.
ollama
ollama is a framework developed by Recogni AI that takes a different approach to LLM inference. Instead of focusing on GPU optimization, ollama targets commodity CPU hardware.
The key innovation in ollama is its use of quantization-aware training. During the model training process, ollama simulates the effect of quantization by adding noise to the model weights. This trains the model to be robust to the quantization error, allowing it to maintain high accuracy even when quantized to low precision.
At inference time, ollama quantizes the model weights to 8-bit integers (INT8). This dramatically reduces the model size and allows for faster computation on CPUs, which are optimized for INT8 arithmetic.
Ollama also employs other optimization techniques, such as:
- Kernel optimization: ollama includes hand-optimized CPU kernels for key operations in transformer models.
- Memory optimization: ollama carefully manages memory allocation and data layout to minimize cache misses and maximize CPU utilization.
Ollama is a good choice when you need to run a large language model on CPU hardware, such as in a serverless environment or on an edge device.
llama.cpp server
Similarly to ollama, llama.cpp targets commodity CPU hardware. It’s a high-performance serving system for deploying multiple language models in production environments. It’s built on top of the efficient llama.cpp inference engine, but extends it with capabilities for serving multiple models and scaling across many machines.
Key features of llama.cpp server include:
- Multi-model serving: Host multiple models in a single server to provide various capabilities without the overhead of separate services.
- Model versioning: Serve different versions of a model simultaneously for smooth updates and easy rollbacks.
- Dynamic model loading: Load and unload models on the fly to adapt to changing workloads and conserve resources.
- Horizontal scaling: Distribute load across multiple server instances to handle high request volumes.
Under the hood, llama.cpp server employs advanced performance optimizations:
- Asynchronous I/O to load models without blocking request handling
- Zero-copy inference to minimize data movement between CPU and GPU
- Intelligent caching to keep frequently-used data in fast storage tiers
- Automatic batching to improve throughput by amortizing inference costs
Practical Recommendations
With all these options available, how do you choose the right tool for your specific use case? Here are some general recommendations:
1. If you have a powerful NVIDIA GPU and need the fastest possible inference, use vLLM. Its CUDA optimizations will give you the best performance on NVIDIA hardware.
2. If you have a model that’s too large to fit on your GPU, use TGI or DeepSpeed MII. These tools will allow you to transparently offload parts of the model to CPU memory, while still getting GPU acceleration for the parts that fit.
3. If you’re deploying a model on NVIDIA hardware and want an easy way to get the best performance, use TensorRT LLM. It will automatically optimize your model for your specific GPU.
4. If you have powerful NVIDIA GPUs and need to serve a large model as a web service, use Triton VLLM. It will allow you to scale your model across multiple GPUs and handle a high volume of requests.
5. If you need to run a large model on CPU hardware, use ollama. Its quantization-aware training and CPU optimizations will give you the best performance on commodity CPUs.
6. If you need to serve multiple models at scale with both GPUs and CPUs, use llama.cpp server. It combines llama.cpp’s efficient inference with serving infrastructure for high-volume, low-latency production deployments.
7. If you’re working with extremely large models (100B+ parameters), consider using a combination of techniques. For example, you could use DeepSpeed MII to parallelize across multiple GPUs, and within each GPU, use TGI to offload unused weights to CPU memory.
Conclusion
Running large language model inference locally is becoming increasingly viable thanks to a wave of new optimization techniques and frameworks. By leveraging the parallel processing power of GPUs and using clever memory management and model compression techniques, it’s now possible to run models with tens of billions of parameters on consumer hardware. As these tools continue to evolve, local inference will become ever more powerful and accessible, opening up new possibilities for privacy-preserving, low-latency, and offline applications of large language models.