As large language models continue to transform everything from research workflows to production pipelines, choosing the right fine-tuning framework can make all the difference in performance, resource utilization, and ease of experimentation. In our previous article, we showed how fine-tuning bridges the gap between a model’s general knowledge and specialized requirements—unlocking higher accuracy, alignment, and domain relevance.
Axolotl, Unsloth, and Torchtune each offer distinct advantages, from Unsloth’s impressive single-GPU speedups and memory savings to Axolotl’s flexibility and Torchtune’s tight integration with the PyTorch ecosystem. In this article, we’ll explore how each toolkit handles key considerations like training throughput, VRAM efficiency, final model quality, ease of setup, and multi-GPU scaling—helping you decide which best meets your specific fine-tuning needs.
Users can also rent GPUs from Hyperbolic and conduct fine tuning experiments.
Axolotl
On Axolotl, users can scale and customize LLMs across different AI models using their free open-source solutions. Supported LLMs include HuggingFace, Cerebras, XGen, Qwen, RWKV, Gemma, MS Phi, Mistral AI, Llama, Eleuther AI, Falcon, MPT.
Axolotl delivers solid performance for fine-tuning LLMs but tends to be slightly slower compared to Torchtune and Unsloth due to its additional abstraction layers that wrap Hugging Face Transformers. It uses best practices like FlashAttention, gradient checkpointing, and memory-efficient training defaults.
New Model Support and Feature Updates
Axolotl’s latest releases (v0.8.x in 2025) added support for Meta’s LLaMA 3 and the new LLaMA 4 models (including LLaMA 4 Multimodal). It can fine-tune popular open models like Mistral-7B/16B, Falcon, Pythia, Google’s Gemma-3 series, Microsoft Phi-2/Phi-3, and more. Critically, Axolotl introduced sequence parallelism via Ring FlashAttention to enable long-context fine-tuning: sequences can be distributed across GPUs, allowing near-linear scaling of context length without running out of memory on a single device. This means Axolotl can handle training with very large context windows (e.g. 32k or beyond) by splitting the sequence across multiple GPUs, complementing its existing multi-GPU strategies (FSDP and DeepSpeed ZeRO). Axolotl also rolled out a multimodal finetuning beta, with built-in recipes for vision-and-language models like LLaVA-1.5, “Mistral-Small-3.1” vision, MLLama, Pixtral, and Gemma-3 Vision.
Other notable new features include support for the REX learning rate scheduler (for potentially faster convergence), cut-cosine cross-entropy (CCE) loss for certain model types (improving training stability on models like Cohere or Gemma), and a “Liger” kernel for more efficient fine-tuning of Gemma-3 models. The Axolotl CLI and config system have also been refined – it still uses simple YAML configs and now supports things like custom tokenizer settings (reserved tokens) and launching a distributed vLLM server for faster data generation during RLHF loops. In short, Axolotl’s ecosystem coverage is extensive: if a new open-source LLM appears (e.g. LLaMA 4 or Qwen2.5), Axolotl is usually quick to offer a config or support patch for it.
Axolotl’s "Cookbooks" offer templates such as…
Process Reward Models: PRMs are a type of reward model trained on step-by-step supervision datasets, where each reasoning step is labeled for correctness. They are especially useful as verifiers to improve language model outputs at inference time, and can also provide fine-grained rewards for reinforcement learning-based fine-tuning.
Multi-GPU and Distributed Training Capabilities
Multi-GPU training is a core strength of Axolotl. It supports DeepSpeed ZeRO-2/3, Fully Sharded Data Parallel (FSDP), and is working to adopt FSDP v2. The addition of sequence parallelism allows it to handle very long contexts efficiently. Many users have successfully fine-tuned models up to 65B/70B parameters using Axolotl on multiple A100 GPUs.
Model Quality, Evaluation, and Inference
Axolotl preserves fine-tuned model quality, using standard optimizers and techniques. Fine-tuned models can be exported to Hugging Face formats and deployed with inference engines like vLLM. Axolotl also integrates seamlessly with Weights & Biases for training telemetry and evaluation.
Unsloth
Performance Benchmarks: Throughput & VRAM Efficiency
Unsloth is a leader in fine-tuning speed and VRAM efficiency. It delivers 2–5x faster training and uses up to 80% less VRAM compared to standard FlashAttention 2 baselines. Benchmarks on RTX 4090 GPUs show that Unsloth is 24% faster than Torchtune even with PyTorch compile optimizations enabled. Unsloth also achieved 10x faster performance on a single GPU and up to 32x faster across multiple GPUs compared to FlashAttention 2, with support for NVIDIA GPUs from Tesla T4 to H100, and portability to AMD and Intel GPUs.
The Unsloth team frequently publishes tutorials (e.g. fine-tuning LLaMA-3 8B on Colab, fine-tuning Gemma-3, etc.) demonstrating these integrations. A hallmark of Unsloth’s updates is maintaining speed advantage: the open-source version still promises ~2x speed and big VRAM savings on a single GPU, but Unsloth also launched a premium “Pro” version which reportedly achieves ~10× faster training on one GPU and up to 30x on multi-GPU clusters, along with 90% memory reduction (versus FA2). This Pro edition unlocks multi-GPU and multi-node support (previously a limitation) and even boasts “up to +30% accuracy” in some scenarios. In terms of features, Unsloth introduced an “ultra-low precision” dynamic quantization (down to 1.58-bit) for certain models to push memory usage to the absolute minimum – a technique useful for inference or for adapter training without losing much quality.
New Model Support and Feature Updates
It also added support for Rank-Stabilized LoRA (RSLORA) and LoftQ, advanced fine-tuning techniques to improve LoRA training stability and integrate quantization into training. Unsloth has beta support for long context reasoning as well: one update showcased training a model for long-context tasks (the “Long-context GRPO” update) to address the challenges of very large context windows. Multimodal training is on the radar too – Unsloth demonstrated fine-tuning LLaMA 3.2 Vision models (incorporating image inputs).
Despite these cutting-edge features, Unsloth remains straightforward to use for beginners: it provides Colab notebooks and a high-level API (FastLanguageModel wrapper) so that fine-tuning can be done in a few lines of code or via config. (In fact, Unsloth’s GitHub repo includes one-click notebooks and even guides on exporting the results to GGUF or integrating with Ollama for deployment.) Summed up, Unsloth’s recent updates solidify its role as the go-to solution when you need to fine-tune fast (and now, at scale) on the newest LLMs.
Model Quality, Evaluation, and Inference
Unsloth has expanded support for Meta’s new Llama 4 models, allowing users to fine-tune Llama 4 Scout (17B, 16 experts) and Llama 4 Maverick (17B, 128 experts), which outperform Llama 3 and rival leading models like GPT-4o and DeepSeek v3 in reasoning and coding. Both models are distilled from the upcoming Llama 4 Behemoth (288B, 16 experts), which surpasses GPT-4.5 and Claude 3.7 on STEM benchmarks. Unsloth enables Llama 4 Scout fine-tuning on a single H100 80GB GPU once 4-bit support via BitsandBytes becomes available, making fine-tuning 1.5x faster, using 50% less VRAM, and extending context length 8x compared to setups with FlashAttention 2. Unsloth has uploaded dynamic 4-bit and 16-bit versions of Llama 4 to Hugging Face for immediate use, and now supports full fine-tuning, 8-bit, pretraining, all transformer-style models (Mixtral, MOE, Cohere), and any training algorithm including GRPO with vision-language models (VLMs). Performance tests on Alpaca datasets further validate Unsloth’s speed and memory savings, demonstrating its leadership in fine-tuning efficiency.
Unsloth has gained major traction, being funded by Microsoft’s M12 and GitHub’s Open Source Fund. Its users include Microsoft, NVIDIA, Meta, NASA, HP, VMware, and Intel. Hugging Face’s TRL library references Unsloth in its documentation, and Unsloth remains one of the fastest-growing open-source projects in AI fine-tuning.
Resources
Fine-tuning Guide by Unsloth: https://docs.unsloth.ai/get-started/fine-tuning-guide
Hugging Face Reasoning Course: https://huggingface.co/reasoning-course
Tutorial: Train your own Reasoning model with GRPO: https://docs.unsloth.ai/basics/reasoning-grpo-and-rl/tutorial-train-your-own-reasoning-model-with-grpo
Torchtune
As the official PyTorch library for LLM fine-tuning, Torchtune has rapidly incorporated new model families and features. In February 2025, Torchtune officially introduced multi-node training, enabling full fine-tuning across multiple nodes to support larger batch sizes and models.
By September 2024 it had added support for LLaMA 3.2 (including the 3B, 11B Vision, and 1B variants); by October/November 2024 it onboarded Alibaba Qwen2.5 models and Google Gemma2 (Gemma’s previous generation). Come December 2024, Torchtune could handle LLaMA 3.3 70B fine-tuning out-of-the-box, and by early 2025 it naturally extended to LLaMA 4 and related new releases (given PyTorch’s close involvement with Meta models).
Torchtune’s design philosophy is to stay extensible and “recipe”-driven: it provides reference YAML configs and Python recipes for various fine-tuning methods. Recent versions (v0.4.0 and beyond) introduced stable support for activation offloading (paging activations to CPU or disk to train larger models on smaller GPUs) and multimodal QLoRA training, enabling parameter-efficient fine-tuning on vision-language models. As mentioned before, a major milestone was multi-node training support – you can now run Torchtune across multiple servers for full fine-tunes of giant models.
Under the hood, this leverages PyTorch’s distributed training (likely via FSDP or DDP with sharded optimizers) to scale up batch sizes and model sizes seamlessly. In terms of fine-tuning techniques, Torchtune has kept pace with research: it supports Supervised Fine-Tuning (SFT), numerous RLHF algorithms, knowledge distillation, and even quantization-aware training (QAT) – all within its unified interface. This means you can fine-tune a model, distill it, apply DPO, and quantize it, using Torchtune’s components in a pipeline.
Despite being feature-rich, Torchtune remains closer to the metal: it’s essentially “just PyTorch”, so advanced users can dive into the code or customize models freely. The CLI (tune command) and config files make common tasks simple, Torchtune’s minimal abstractions mean you might spend a bit more time tweaking code for custom behaviors. The benefit is a very transparent training process and easy debugging (which many researchers and engineers appreciate).
Performance Benchmarks: Throughput & VRAM Efficiency
Torchtune provides excellent training throughput using PyTorch 2.x features such as TorchCompile, fused operations, and FlashAttention. Its pure PyTorch design ensures minimal overhead and broad compatibility, including support for both NVIDIA and AMD GPUs.
New Model Support and Feature Updates
Torchtune has kept up with model releases, supporting LLaMA 3.2, Qwen2.5, Gemma2, LLaMA 3.3 70B, and LLaMA 4. New features include activation offloading to allow training larger models on limited GPUs, multimodal QLoRA training for vision-language models, and support for supervised fine-tuning (SFT), DPO, PPO, GRPO, knowledge distillation, and quantization-aware training (QAT).
Model Quality, Evaluation, and Inference
Torchtune fine-tuned models maintain high quality. It supports quantization-aware training (QAT) for generating small, efficient models without compromising performance. Torchtune models can easily be exported to Hugging Face formats, ONNX, or remain in PyTorch.
Conclusion
By 2025, Axolotl, Unsloth, and Torchtune have all matured into powerful LLM fine-tuning frameworks. Axolotl remains a strong choice for beginners and multi-GPU setups thanks to its community-driven defaults and rapid support for new models. Unsloth dominates in single-GPU scenarios with its unprecedented speed and memory efficiency, pushing the limits of what’s possible with limited hardware. Torchtune, on the other hand, offers the most seamless integration with PyTorch’s evolving ecosystem, scaling cleanly to large multi-node setups while maintaining flexibility and extensibility for researchers and developers.
Each framework now supports LoRA/QLoRA, RLHF, multimodal fine-tuning, and long-context training, with their differences primarily revolving around specialization: Axolotl for usability, Unsloth for efficiency, and Torchtune for deep customization and scalability.
Resources
Weights & Biases - Guide to Fine Tuning (Tutorial Course) https://wandb.ai/site/solutions/llm-fine-tuning/
Weights & Biases - https://wandb.ai/augmxnt/train-bench/reports/torchtune-vs-axolotl-vs-unsloth-Trainer-Performance-Comparison--Vmlldzo4MzU3NTAx
Axolotl GitHub Repository - https://github.com/OpenAccess-AI-Collective/axolotl
Unsloth Documentation - https://unsloth.ai/
Torchtune Official Documentation (PyTorch) - https://pytorch.org/torchtune/
Hugging Face Transformers Library - https://huggingface.co/docs/transformers/index
Hugging Face TRL (Transformer Reinforcement Learning) - https://huggingface.co/docs/trl/index
LLaMA-Factory GitHub - https://github.com/hiyouga/LLaMA-Factory
SWIFT Paper on Arxiv - https://arxiv.org/abs/2408.05517
Entry Point AI Platform - https://www.entrypointai.com/
SuperAnnotate LLM Fine-Tuning Blog - https://www.superannotate.com/blog/llm-fine-tuning

About Hyperbolic
Hyperbolic is democratizing AI by delivering a complete open ecosystem of AI infrastructure, services, and models. Through coordinating a decentralized network of global GPUs and leveraging proprietary verification technology, developers and researchers have access to reliable, scalable, and affordable compute as well as the latest open-source models.
Founded by award-winning Math and AI researchers from UC Berkeley and the University of Washington, Hyperbolic is committed to creating a future where AI technology is universally accessible, verified, and collectively governed.
Website | X | Discord | LinkedIn | YouTube | GitHub | Documentation