Model Parallelism
A technique for training or running AI models that are too large for a single GPU by splitting the model across multiple GPUs.
Model parallelism is a strategy for training and running AI models that are too large to fit on a single GPU by dividing the model across multiple GPUs or machines. Each GPU holds a portion of the model and they work together to process data.
Why model parallelism is necessary
Modern large language models have billions or even trillions of parameters. A model with 70 billion parameters in 16-bit precision requires about 140 GB of memory just for the weights β far exceeding the memory of any single GPU (typically 40-80 GB for high-end models). Training requires even more memory for gradients, optimizer states, and activations. Without parallelism, training or running these models would be impossible.
Types of model parallelism
- Tensor parallelism: Splits individual layers across GPUs. A single matrix multiplication is divided so each GPU computes a portion and they share results. This allows very large layers to be distributed.
- Pipeline parallelism: Assigns different layers to different GPUs. Data flows through a "pipeline" β GPU 1 processes layers 1-10, GPU 2 processes layers 11-20, and so on. Like an assembly line, multiple batches can be in-flight simultaneously.
- Sequence parallelism: Splits the input sequence across GPUs, with each processing a portion of the tokens.
- Expert parallelism: In mixture-of-experts models, different "expert" sub-networks are placed on different GPUs.
Model parallelism vs data parallelism
Data parallelism keeps a complete copy of the model on each GPU and splits the training data across them. Each GPU processes different examples simultaneously. Model parallelism splits the model itself. In practice, large-scale training uses both β the model is split across groups of GPUs (model parallelism) and multiple such groups process different data batches simultaneously (data parallelism).
Engineering challenges
Model parallelism introduces significant complexity. GPUs must communicate constantly to share intermediate results, and this communication can become a bottleneck. Balancing work across GPUs to avoid idle time requires careful design. Frameworks like Megatron-LM, DeepSpeed, and FSDP have made these techniques more accessible, but large-scale distributed training remains an engineering challenge.
Why This Matters
Model parallelism explains how companies train and run the massive AI models that power modern applications. Understanding it helps you appreciate the infrastructure costs behind AI services and why hardware access is a competitive advantage in the AI industry.
Related Terms
Continue learning in Advanced
This topic is covered in our lesson: AI Infrastructure and Scale