Media processing workflows are inherently complex, often requiring extensive state management and continuous updates to ensure media encodes remain current. At Netflix, we have developed the Plato media workflow platform, a key component of our larger
Cosmos media processing system. This talk will delve into how we process millions of media workflow events daily, ensuring durability and scalability while enhancing the developer experience.
Enhancing RPC Durability in Media WorkflowsCosmos is a microservices-based platform, where each processing component is implemented as a microservice or as a serverless function. This necessitates the workflows to make RPC calls to execute the tasks at scale asynchronously. Plato’s unique approach to handling remote procedure calls (RPCs) using message-passing techniques makes the flaky RPC calls more durable and reliable. This adaptation allows our users to build on a resilient RPC client foundation, mitigating the impacts of potential failures on workflow continuity.
Scaling to Millions of Workflow EventsThe media processing domain is characterized by its bursty nature of work, where the demand for producing encodes often exceeds available compute resources. To address this, Plato incorporates features like priority-based task queues, execution avoidance, and a combination of dynamic and static graph execution models. Together, these features enable us to process millions of workflow events daily. We will present real-world scenarios that showcase how these technologies allow Plato to efficiently scale up and durably execute millions of workflow events.
Prioritizing Developer ExperienceWhile ensuring durability is crucial for our users, it cannot come at the cost of developer experience. The Plato platform allows users to seamlessly bring their own strongly typed data models. This feature ensures that workflow execution state can be stored and retrieved reliably, testing workflows with strong contracts, and lowering the barrier to entry for our users by enhancing the platform’s usability. We will highlight case studies that demonstrate how Plato provides a good developer experience and discuss some of the open challenges we are working on.
ConclusionThis talk will provide an overview of how Netflix implements durable executions to process media encodes at scale. Attendees will gain insights into the challenges and techniques that Netflix uses in the media processing space, with practical examples from the Plato platform that highlights our approach to durability, scalability, and developer experience.
References- For an overview of the underlying technologies and design principles of the Cosmos platform, please refer to our blog post here.