Core AI

Model Parallelism

Last reviewed: April 2026

A technique for training or running AI models that are too large for a single GPU by splitting the model across multiple GPUs.

Model parallelism is a strategy for training and running AI models that are too large to fit on a single GPU by dividing the model across multiple GPUs or machines. Each GPU holds a portion of the model and they work together to process data.

Why model parallelism is necessary

Modern large language models have billions or even trillions of parameters. A model with 70 billion parameters in 16-bit precision requires about 140 GB of memory just for the weights — far exceeding the memory of any single GPU (typically 40-80 GB for high-end models). Training requires even more memory for gradients, optimizer states, and activations. Without parallelism, training or running these models would be impossible.

Types of model parallelism

Tensor parallelism: Splits individual layers across GPUs. A single matrix multiplication is divided so each GPU computes a portion and they share results. This allows very large layers to be distributed.
Pipeline parallelism: Assigns different layers to different GPUs. Data flows through a "pipeline" — GPU 1 processes layers 1-10, GPU 2 processes layers 11-20, and so on. Like an assembly line, multiple batches can be in-flight simultaneously.
Sequence parallelism: Splits the input sequence across GPUs, with each processing a portion of the tokens.
Expert parallelism: In mixture-of-experts models, different "expert" sub-networks are placed on different GPUs.

Model parallelism vs data parallelism

Data parallelism keeps a complete copy of the model on each GPU and splits the training data across them. Each GPU processes different examples simultaneously. Model parallelism splits the model itself. In practice, large-scale training uses both — the model is split across groups of GPUs (model parallelism) and multiple such groups process different data batches simultaneously (data parallelism).

Engineering challenges

Model parallelism introduces significant complexity. GPUs must communicate constantly to share intermediate results, and this communication can become a bottleneck. Balancing work across GPUs to avoid idle time requires careful design. Frameworks like Megatron-LM, DeepSpeed, and FSDP have made these techniques more accessible, but large-scale distributed training remains an engineering challenge.

Want to go deeper?

This topic is covered in our Advanced level. Access all 100+ lessons free.

Why This Matters

Model parallelism explains how companies train and run the massive AI models that power modern applications. Understanding it helps you appreciate the infrastructure costs behind AI services and why hardware access is a competitive advantage in the AI industry.

Related Terms

Transformer

The neural network architecture behind modern AI assistants like ChatGPT and Claude. Introduced in 2017, it processes all words simultaneously using an attention mechanism.

Large Language Model (LLM)

A type of AI trained on vast amounts of text to understand and generate human language. ChatGPT, Claude, and Gemini are all LLMs.

Deep Learning

A subset of machine learning that uses neural networks with many layers to learn complex patterns. The 'deep' refers to the number of layers, not the depth of understanding.

Neural Network

A computing system loosely inspired by the human brain, made of layers of interconnected nodes that learn to recognise patterns in data.

Learn More

Continue learning in Advanced

This topic is covered in our lesson: AI Infrastructure and Scale

← Back to Glossary