How did we prepared to scale for "La Copa del Rey"
Before starting with this post, you should know that a soccer match between Futbol Club Barcelona and Real Madrid is one of the biggest events in Spain nowadays, although it isn’t the Champions League or “La Liga” it is a big match, those are the two teams with more fans in Spain and one of the most followed matches in the world because of the mediatic impact that has been around this two teams in the past 5 years.
Two weeks ago we were told that our company had closed an agreement, to add advertisement on the Camp Nou during the Semifinals of “La copa del Rey” which was going to be disputed by this two teams I mentioned earlier. Because of this marketing campaign, we had to prepare our application infrastructure to scale and handle the “possible” increase of traffic that this supposed to our application.
As you can see on our website if you haven’t done that yet, we are a service which provides Video On Demand and we can’t allow ourselves to fail, our service must not have maintenance windows, and they shouldn’t notice when we make deep changes on our infrastructure, we only get one chance to give a good impresion because of the different “free” alternatives in the market.
Our web application and API is evolving into a micro-services oriented application which relies on Amazon AWS. Since we moved to Amazon AWS in 2011, we have tried to apply the best practices recommended by Amazon Evangelists (@mza has helped us a lot on this matter) and we’ve also learned from other companies which relies on them as well (Instagram, Netflix, 4sq, Quora, etc…).
Wuaki’s web application and API, relies a 100% in the Ireland region of AWS, which has 3 availability zones (EU-WEST-1a, EU-WEST-1b, EU-WEST-1c). During the past year we have managed to implement all the tools recommended by Amazon to ensure system availability, continuity and resilience.
- Amazon ELB’s
- Amazon Route53
- Amazon Cloudfront and S3
- Amazon EC2 Multi-AZ deployment
- Amazon RDS Multi-AZ deployment
- Amazon Autoscaling
Preparing ourselves for the match
One of the most important thing you have to have in mind before scale your application is that when you setup your Amazon AWS account by default Amazon limits everything inside your profile, which means that, when you start using your account you can’t create unlimited EC2 instances, Elastic IPs, ELBs, etc… Amazon AWS accounts are mostly limited to 5-10 items on each of their services. If you reach to the limits or if you think that this isn’t going to be enough for you, you should send a request explaining your needs and the nature of your request and they will help you (this can take several hours).
One key aspect when you want to scale your architecture using Amazon AWS is to use Amazon AWS ELBs. ELBs are the most intelligent way to manage load balancing inside Amazon AWS public or private cloud, but they have a downside if you plan to receive thousands of request and a specific moment they won’t work properly unless you ask for a pre-warming of some ELBs. This can be done opening a ticket with Amazon AWS support team, who will ask you some details related to your application.
Last week we asked Amazon to increase our EC2 limits to a maximum of 150 servers and to pre-warm 6 of our ELBs to handle the possible increase of traffic that we could face during this event. Everything was solved on their side in one day (they asked me a lot of questions).
In parallel, we configured our autoscaling policies to increase the minimum working instances on each layer of our stack lowering the alarm thresholds that were going to trigger up and down our autoscaling policies. We’ve learned that each instance on Amazon has different properties, far from the obvious CPU and memory your NICs also behave in different ways depending on the instance type.
Amazon AMI Types Nic limitations
Amazon m1.large instances has 500 MBits/sec link and m1.xlarge has 1000 MBits/sec for example. You must take into consideration that if you use Amazon EBS part of the network traffic will go to the EBS storage traffic unless you use provisioned IOPS. Because of this NetworkOUT metric is the most reliable when you Autoscale.
Most of Wuaki.tv infrastructure relies on m1.large AMI types on the core application layer, and m1.xlarge on the database layer using Multi-AZ deployment, which means that is a Master instance fails, in a matter of seconds Amazon will trigger a backup instance on any other availability zone. Our current MySQL stack currently looks like this.
We also implemented a little contingence plan to avoid overload on our stack. Since the past friday, each user who tries to access Wuaki.TV is redirected to a signup landing page unless his public IP address is a spanish IP. That was planed to help us in case we receive lot’s of traffic from countries where we aren’t currently available.
During the match
During the match we received almost 10 times more subscription registrations than we get on daily basis, and we successfully handled the traffic peaks during the half-time without any issue, mainly because it was a intense match and people were into the game most of the time, so we had to ensure that all the users could access our application even during the half-time.