Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

MLOps.community - A podcast by Demetrios

Categories:

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com Simon Karasik⁠ is a proactive and curious ML Engineer with 5 years of experience. Developed & deployed ML models at WEB and Big scale for Ads and Tax. Huge thank you to Nebius AI for sponsoring this episode. Nebius AI - https://nebius.ai/ MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints. // Abstract The talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, how big are the checkpoints. It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing. // Bio Full-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Simon on LinkedIn: https://www.linkedin.com/in/simon-karasik/ Timestamps: [00:00] Simon preferred beverage [01:23] Takeaways [04:22] Simon's tech background [08:42] Zombie models garbage collection [10:52] The road to LLMs [15:09] Trained models Simon worked on [16:26] LLM Checkpoints [20:36] Confidence in AI Training [22:07] Different Checkpoints [25:06] Checkpoint parts [29:05] Slurm vs Kubernetes [30:43] Storage choices lessons [36:02] Paramount components for setup [37:13] Argo workflows [39:49] Kubernetes node troubleshooting [42:35] Cloud virtual machines have pre-installed mentoring [45:41] Fine-tuning [48:16] Storage, networking, and complexity in network design [50:56] Start simple before advanced; consider model needs. [53:58] Join us at our first in-person conference on June 25 all about AI Quality

Visit the podcast's native language site