MLOps Coffee Sessions #14 Conversation with the Creators of Dask // Hugo Bowne-Anderson and Matthew Rocklin

MLOps.community - A podcast by Demetrios Brinkmann

Categories:

Dask What is it? Parallelism for analytics What is parallelism? Doing a lot at once by splitting tasks into smaller subtasks which can be processed in parallel (at the same time) Distributed work across multiple machines and then combining the results Helpful for CPU bound - doing a bunch of calculations on the CPU. The rate at which process progresses is limited by the speed of the CPU Concurrency? Similar but a but things don’t have to happen at the same time, they can happen asynchronously. They can overlap. Shared state Helpful to I/O bound - networking, reading from disk, etc. The rate at which a process progresses is limited by the speed of the I/O subsystem. Multi-core vs distributed Multi-core is a single processor with 2 or more cores that can cooperate through threads - multithreading Distributed is across multiple nodes communicating via HTTP or RPC Why is this hard? Python has it challenges due to GIL, other languages don't have this problem Shared state can lead to potential race conditions, deadlocks, etc Coordination work across the machines For analytics? Calculating some statistics on a large dataset can be tricky if it can’t fit in memory // Show Notes Coiled Cloud: https://cloud.coiled.io/ Coiled Launch Announcement: https://medium.com/coiled-hq/coiled-dask-for-everyone-everywhere-376f5de0eff4 OSS article: https://www.forbes.com/sites/glennsolomon/2020/09/15/monetizing-open-source-business-models-that-generate-billions/#2862e47234fd Amish barn raising: https://www.youtube.com/watch?v=y1CPO4R8o5M MessagePassingInterface: https://en.wikipedia.org/wiki/Message_Passing_Interface ----------- Connect With Us ✌️------------- Join our Slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with David on LinkedIn: https://www.linkedin.com/in/aponteanalytics/ Connect with Matthew on LinkedIn: https://www.linkedin.com/in/matthew-rocklin-461b4323/ Timestamps: 0:00 - Intro to Matthew Rocklin and Hugo Bowne-Anderson 0:37 - Matthew Rocklin's Background 1:17 - Hugo Brown-Anderson's Background 3:47 - Where did that inspiration come from? 10:04 - Is there a close relationship between Best Practices and Tooling or are these two separate things? 11:27 - Why is Data Literacy important with Coiled? 14:46 - How do you think about the balance between enabling Data Science to have a lot of powerful compute? 17:05 - Machine Learning as a space for tracking best practices experimentation 19:32 - What makes Data Science so difficult?   24:07 - How can a for-profit company compliment Open Source Software (OSS) 29:40 - Amazon becoming a competitor with your own open-source technology (?) 32:50 - How do you encourage more people to contribute and ensure quality? 34:58 - Do you see Coiled operating within the DASK ecosystem? 37:30 - What is DASK? 39:19 - What should people know about parallelism? 41:28 - Why is it so hard to put things back together? 41:34 - Why does Python need a whole new tool to enable that? Or maybe some other tools as well? 44:44 - Dynamic Tasks Scheduling as being useful to Data Scientists 47:15 - Why is reliability in particular important in Data Science? 52:27 - What's in store for DASK?

Visit the podcast's native language site