Cost/Performance Optimization with LLMs [Panel]

MLOps.community - A podcast by Demetrios Brinkmann

Categories:

Sign up for the next LLM in production conference here: https://go.mlops.community/LLMinprod Watch all the talks from the first conference: https://go.mlops.community/llmconfpart1 // Abstract In this panel discussion, the topic of the cost of running large language models (LLMs) is explored, along with potential solutions. The benefits of bringing LLMs in-house, such as latency optimization and greater control, are also discussed. The panelists explore methods such as structured pruning and knowledge distillation for optimizing LLMs. OctoML's platform is mentioned as a tool for the automatic deployment of custom models and for selecting the most appropriate hardware for them. Overall, the discussion provides insights into the challenges of managing LLMs and potential strategies for overcoming them. // Bio Lina Weichbrodt Lina is a pragmatic freelancer and machine learning consultant that likes to solve business problems end-to-end and make machine learning or a simple, fast heuristic work in the real world. In her spare time, Lina likes to exchange with other people on how they can implement best practices in machine learning, talk to her at the Machine Learning Ops Slack: shorturl.at/swxIN. Luis Ceze Luis Ceze is Co-Founder and CEO of OctoML, which enables businesses to seamlessly deploy ML models to production making the most out of the hardware. OctoML is backed by Tiger Global, Addition, Amplify Partners, and Madrona Venture Group. Ceze is the Lazowska Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington, where he has taught for 15 years. Luis co-directs the Systems and Architectures for Machine Learning lab (sampl.ai), which co-authored Apache TVM, a leading open-source ML stack for performance and portability that is used in widely deployed AI applications. Luis is also co-director of the Molecular Information Systems Lab (misl.bio), which led pioneering research in the intersection of computing and biology for IT applications such as DNA data storage. His research has been featured prominently in the media including New York Times, Popular Science, MIT Technology Review, and the Wall Street Journal. Ceze is a Venture Partner at Madrona Venture Group and leads their technical advisory board. Jared Zoneraich Co-Founder of PromptLayer, enabling data-driven prompt engineering. Compulsive builder. Jersey native, with a brief stint in California (UC Berkeley '20) and now residing in NYC. Daniel Campos Hailing from Mexico Daniel started his NLP journey with his BS in CS from RPI. He then worked at Microsoft on Ranking at Bing with LLM(back when they had 2 commas) and helped build out popular datasets like MSMARCO and TREC Deep Learning. While at Microsoft he got his MS in Computational Linguistics from the University of Washington with a focus on Curriculum Learning for Language Models. Most recently, he has been pursuing his Ph.D. at the University of Illinois Urbana Champaign focusing on efficient inference for LLMs and robust dense retrieval. During his Ph.D., he worked for companies like Neural Magic, Walmart, Qualtrics, and Mendel.AI and now works on bringing LLMs to search at Neeva. Mario Kostelac Currently building AI-powered products in Intercom in a small, highly effective team. I roam between practical research and engineering but lean more towards engineering and challenges around running reliable, safe, and predictable ML systems. You can imagine how fun it is in LLM era :). Generally interested in the intersection of product and tech, and building a differentiation by solving hard challenges (technical or non-technical). Software engineer turned into Machine Learning engineer 5 years ago.

Visit the podcast's native language site