Scaling Live Streaming for 62.2 Crore Concurrent Viewers: The Tech Behind India vs Pakistan Match
Extensive Preparation and Audits:
- Planning starts months in advance .
- Service owners define breaking points for their systems (front end, CDN, load balancer, back end, databases).
- Partners undergo audits to ensure their infrastructure and mitigations are ready.
-
Front-End Optimisation:
- Code freezes are implemented well in advance of the back-end systems, and continue to ship new features during the tournament.
- Feature flags are used extensively to quickly turn off features on specific platforms if issues arise (e.g., increased crash rates).
- Simulations using tools like Charles mimic potential API failures (5XX errors, latency, DNS failures) to ensure a smooth user experience.
- Graceful degradation is implemented to minimise customer impact.
- Prioritising core features (P0 - live stream) and ads.
- Non-critical features (P1, P2) can fail without displaying error messages to users.
- Exponential back-off strategies are implemented for API retries to prevent system overload.
- Source Feed and Encoding The video feed originates from the venue (stadium) with multiple cameras connected to a Production Control Room (PCR). The director in the PCR selects which camera to use and which angle to show. The feed is then passed through a contribution encoder to compress it before it is sent to the cloud. Within the cloud ecosystem, a distribution encoder transforms the video into formats such as HLS and DASH, which are compatible with different players and devices.
- Orchestration An orchestrator manages the entire workflow, from contribution encoding to cloud infrastructure. It controls which endpoint to push content to, the configuration of the distribution encoder, and which CDN endpoints to use. The orchestrator generates playback URLs and pushes them to the content management system, which the client apps interact with for browsing.
- CDN Optimisation To scale effectively, the system uses a content delivery network (CDN). The CDN caches video segments, which are typically four to six seconds in duration. The client player pulls a master manifest, which contains information about multiple child manifests that contain the list of segments. The player requests these segments from the CDN. Engineering efforts are focused on tuning CDN configurations and determining the right Time To Live (TTL) to balance cache hits and staleness.
- Adaptive Bitrate Streaming Adaptive bitrate streaming ensures a smooth viewing experience. The player measures the user's bandwidth and adjusts the video quality accordingly. If the download speeds slow, the player switches to a lower quality layer. The server can also implement degradations by limiting the number of layers accessible to users.
- Capacity Planning Capacity planning is a complex exercise that starts well in advance. It involves working with providers to ensure sufficient resources, including compute, RAM, disk, and network. Capacity planning also requires modelling user traffic and understanding growth patterns.
- Challenges Specific to the Region
- Mobility Due to the high mobile usage in India, the system needs to handle devices constantly moving and switching between networks.
- Battery Life Considerations are made to minimise battery consumption, including optimising background processing, phone brightness, volume levels and video profiles.
- Custom Scaling Systems Auto-scaling doesn't work because it cannot respond to sudden surges caused by events such as innings breaks. Instead, custom scaling systems are used that align to a concurrency metric. These systems use models to translate concurrency to the expected traffic on various services.
- Importance of Planning and Drills "Game days" simulate actual match conditions to test the system's ability to scale. These drills involve generating synthetic traffic and observing how the systems respond.
- Monitoring and Metrics It's crucial to have detailed measurements in place to identify problems. Leading indicators, such as buffer time and play failure rate, are prioritised to detect issues before users complain.
- Trade-offs and Decision Making Engineers had to make trade-offs between latency and smoothness. They found a sweet spot that worked for the majority of users. The number of requests also had to be considered, as increasing the frequency of manifest calls can increase the compute cycles on the CDN.
- Degradation Frameworks Degradation frameworks were in place to reduce data collection or change the interval at which data is collected during high traffic.
- Finite Cloud Resources Cloud resources are not infinite, and there's a limit to how much capacity can be added in a given region. This requires careful capacity planning and coordination with CDN providers.
- Understanding System Limitations Engineers must understand the limitations of their systems and plan for potential failures. It's essential to consider everything that can fail and take steps to mitigate those risks. Detailed measurements are crucial for identifying the root cause of problems.
- Evolving Systems The live streaming landscape constantly evolves, so it's important to stay up-to-date with the latest technologies and trends. Continuous learning and experimentation are essential for building and maintaining systems at this scale.
Asynchronous Processing: Kafka is used for messaging, with careful planning for throughput, producer/consumer rates and partition. Non-critical data processing can be delayed until after the match
Ad Insertion:
- Server-side ad insertion is used, with personalisation achieved through cohort-based targeting.
Comments
Post a Comment