Learning from Netflix – Part 2

Learn

In the last blog post, I had listed down the tools and practices introduced by Netflix in the presentation at AWS Re-invent 2013. In this second part of the blog series, I will attempt to uncover the real learning pointers that can be derived from such techniques and its effectiveness to any Cloud application developer.

1. Using a Cloud Provider is not same as using a hosting provider. It requires delicate planning, process and engineering efforts over a period of time. Without all this, an organisation cannot leverage all the benefits of a Cloud service.

2. Having an Agile infrastructure alone cannot solve problems if your developers have to perform too many rudimentary operations to use it. That also leads to another problem – Giving direct access to your developers and not be able to manage it effectively. AWS Admin Console is good from an Operations point of view (read System Admin, DevOps). One of the choices could be creating Development User Roles in the IAM, but the developers still have to live with the Ops view of the entire infrastructure. At a certain usage level of cloud services, organisations may want to build higher abstractions (read AWS Beanstalk style) over the existing functionalities, that reduces the effort of Developers. Netflix built Asgard as a useful abstraction over the AWS infrastructure to be able to provide powerful and easily consumable capabilities to its developers, thereby making them empowered. Workflow and Approvals could also be built to create the most minimal denominator for moderating access to the AWS infrastructure.

3. Avoiding Infrastructure sprawl is also one of the important needs for an organisation dealing with Cloud services. The elastic nature of infrastructure, over the period of time, tends to create resources which are no longer required or used. Having to deal with this manually means either to have approval and expiration windows, or to create your own “Garbage Collectors”. Netflix created their own version of Cloud Garbage collector with Janitor Monkey. But the secret sauce is the service called Edda. Edda records historical data about each AWS resource via the Describe API calls, and makes it available for retrieval via Search and as an input to Janitor monkey for disposing old resources which no one uses. Building an Engine that records historical data about the AWS resources and provides easy consumable interface to identify old resources is the first part of the puzzle. The second part is to use this information automatically to delete these resources when not needed.

4. Dealing with multiple environments and ever-growing infrastructure requires a deep discipline in how teams use the resources and perform day-to-day operations. Also, identifying who performed what operation on which resource is essential to the overall system monitors to identify fault situations and perform effective resolutions in time. Netflix way of handling this is by introducing Bastion machine as an intermediary / jump off to access the EC2 instances. This allows Netflix to moderate the access to the EC2 instances, implement audit trail for operations performed, and also implement security policies.

5. As a Service business models tend to provide organisations with flexible Opex model, but if not managed well, can soon create more problems than solutions. One of such issues is the unwarranted increase in utilisation and thereby total accumulated costs of cloud resources. Also, it is paramount for an organisation to be able to slice and dice the costs incurred across multiple divisions and projects if a common AWS account is used. A visualisation into how organisation uses AWS resources and costs incurred during the course of operation can be a very helpful for effective charge backing, and even reducing wastage. Netflix open sourced a tool by the name ICE(https://github.com/Netflix/ice) to provide visibility into the utilisation of AWS resources across the organisation, especially through the Operational expense.

6. One of the general rules of cloud native development and deployment is not to spend time on recovery. Essentially replacing supersedes recovery. Replacing can only win if the time and money to replace is dramatically more than recovery. Cloud and EC2 is a volatile environment like every other cloud provider. Hence, things can go wrong, and EC2 instances may vanish off in thin air. Hence, the rule of the game requires organisation to spend time on Automation. Automation includes fast replacement, and minimal time spent on replacement. One of the ways this is possible is by ensuring minimal number of steps are performed after instantiation of EC2 instances. This is possible via creating packaged AMIs that are baked with installation and configuration of the required application stack. Alternatively, using services like AWS CloudFormation and integrated tools like Chef, the entire process of bootstrapping the application could be scripted. If the time and cost of performing bootstrapping via scripts is more than the threshold, then baked AMIs work as a good choice. Baked AMIs, however, have drawbacks if frequent changes are required to be made on the application installation and configuration. In my experience, a balance of baked AMIs and bootstrapping scripts provides a good alternative. Netflix through the OSS tool Aminator allows baking of AMIs. One time effort of creating these AMIs lead to faster instantiation of EC2 resources. This can be also used with CloudFormation to fully automate the infrastructure & application provisioning.

7. Netflix has provided a good insight into how it leveraged SOA to accelerate its Cloud strategy. Eureka from Netflix is a good fit in the overall SOA infrastructure, that provides a Service Registry for loosely coupled services and dynamic discovery of dependencies. Different Services can lookup for other remote services via Eureka and in return get useful meta-data about services. Edda also help short-circuit the connection between co-located (in the same zone/region) services. Services in the same zone can talk to each other rather than talking to their distant counterparts (located in other zones / regions). Eureka also helps in recording the overall health of individual services, thereby allowing dynamic discovery of healthy service alternatives and thereby increasing the fault tolerance capability of the system as a whole.

8. One of my favourite applications in the Netflix OSS toolset is Edda. Although I briefly touched upon the service in previous points, I would still want to elaborate the learning from this tool. Edda as described in the last blog post, is a service that records historical data about each and every AWS resource that was used by the Overall System. Through continuous tracking of state information about each AWS resource, it creates an index of state history, and thereby allows one to identify the changes that has gone into the resource over the period of time. The possibility for this kind of tool is limitless. Not only it creates a version for all the cloud assets / resources an organisation uses, it allows search functionality on it, thereby allowing queries like “what changed in this resource over last few days” or “when was this property set to this attribute” ? All this helps in resolving complicated configuration problems, and can be used to perform analytics on how a Cloud resource changes over time. The output of analytics can then be used to perform better system design and effective use of AWS resources. Look here for more :- https://github.com/Netflix/edda/wiki#dynamic-querying

9. I got introduced to “Circuit Breakers pattern” from the book Release It ! (http://pragprog.com/book/mnee/release-it). Michael Nygard provided a useful abstraction to contain further degrading of a system by failing fast. For instance, a service consumer calling a remote API is prone to many exceptional conditions like Timeouts, Service unavailable etc. Having to manage each and every such scenario across all layers of your code is the first hurdle that a developer has to go through. The second hurdle is to ensure the system does not keep going through the repeated process of failure realisation on a separate invocation of our service consumer. Circuit breakers can be configured with threshold number of such failures that can happen in these kind of situations. If the threshold number is crossed, circuit breaker comes to action, and returns a logical error without trying out the actual call to the remote API. This allows the system to be reactive and not to waste critical resources on retrying failed scenarios. A Circuit breaker dashboard can trigger alerts to the operations team letting them be aware of such scenarios, and plan for resolutions. The overall system however goes through a degraded performance without any actual blackout. Netflix created its own version of circuit breakers via Hystrix project. Together with Hystrix dashboard, it’s an effective tool to fit the arsenal of a Cloud Geek.

Learning from Netflix – Part 1

I finally found some time to go over this awesome introduction to Netflix OSS.

Netflix OSS has emerged as a comprehensive Platform for any organisation using AWS for its business. It has also inspired alternate cloud (read private cloud) enthusiasts. Netflix OSS is a bold vision of how application should be architected and designed on a cloud scale, and is no short of a standard in its field.

I present here a summary of the tools and techniques discussed in Netflix OSS and Netflix Cloud scale engineering :-
  1. Netflix uses 4 different AWS accounts – each for Dev/Test/Build, Production, Audit and Archive environment.
    1. Engineering team uses the Dev, Test and Build environment hosted under AWS Account #1
    2. The builds are promoted to the Production environment which is hosted under AWS Account #2
    3. Continuous backup of the Production environment is taken and hosted under AWS Account #3 used for Archiving. Usually the Backup includes RDS and Cassandra Backup
    4. A separate Account #4 is used for Infrastructure that will be used for auditing purposes, for eg:- SOX compliance. This is separate infrastructure and a separate account is used to isolate the developers to keep releasing their functionalities with minimal intervention from auditing requirements.
    5. Every weekend, Production Data is moved over to the Test Environment, so that the QA can start using the data the very next week for performing Test activities
  2. Two-factor authentication and IAM roles are standard for all access to AWS infrastructure (read: Good Practices for IAM)
  3. Security Groups are configured on each Resource ensuring the right ingress and egress filters.
  4. Netflix uses a Bastion machine that works as a Jump-off server for everybody in the engineering team to access EC2 instances. The Bastion machine is given exclusive access to the AWS infrastructure. This means that the Team cannot access the instances directly, and they have to SSH to Bastion machine to access EC2 instances.
    1. The Bastion machine is also used for sideways copying (copy from one EC2 instance to another EC2 instance). This means copying within inter EC2 instances is NOT allowed directly without Bastion. 
    2. Audit Trail and Command history can be enabled for Bastion machine allowing moderation for each user
  5. AnsWerS founder – Peter Sankauskas has done an awesome contribution to Netflix OSS by providing pre-baked AMIs for each of the Netflix components - http://answersforaws.com/resources/netflixoss/
  6. Tool #1 : Asgard (https://github.com/Netflix/asgard)
    1. Developer focussed AWS console that complements AWS Admin console (which is more Operations focussed)
    2. Used for Red/Black Deployment (http://techblog.netflix.com/2013/08/deploying-netflix-api.html)
  7. All application state (including application sessions) is available in Cassandra and Memcached to allow stateless application behaviour
  8. Tool #2 : ICE (https://github.com/Netflix/ice)
    1. Visualizes detailed costs and usage of AWS Resources
    2. Billing reports are sent to Managers for detailed analysis of utilisation per Project / Division
    3. Ensures effective use of AWS resources as per Budget allocated
  9. Build pipeline is managed via Jenkins
    1. Uses Cloudbees for Continuous Delivery
    2. Whenever a developer pulls a repo, CloudBees runs all the test cases associated for the code revision and verifying the sanity of the build
  10. Tool #3 : Aminator (https://github.com/Netflix/aminator)
    1. Used for baking custom AMIs that can be used for spinning up Application instances
    2. Takes a Base AMI as an input to the tool
    3. It uses an EBS volume that is mounted on a test EC2 instance while baking. The mounted partition is then chroot-ed, and all the relevant packages (as per the requirement of the application) are installed on this volume.
    4. This Volume is then used for creating the Custom AMI
  11. Tool #4: Eureka (https://github.com/Netflix/eureka)
    1. Acts as a Service Registry
    2. Used to help discovery of Services and related meta-data
    3. All Services in a SOA environment registers with Eureka to help it know about following :-
      1. When the Service boots up
      2. When the Service is ready for accepting requests
      3. All the meta-data about the Service – like IP Address, Zones where the service is running
      4. Health information of each Service
    4. Allows for in-memory lookup of information about services
    5. Every 30 seconds registered Services have to update Eureka about its meta-data
    6. Interesting use-case #1:- Service A needs to call Service B and for that it needs to use Eureka to identify the Endpoint of the Service B. The meta-data for Service B can include IP addresses of its end-points. Service A can use this IP Address to directly call the Service B without performing DNS resolution.
    7. Interesting use-case #2:- Service A can identify the information about whereabouts of Service B’s endpoints which is closest. This means it allows for intra-Zone calls instead of Cross-Region calls between Service A and Service B.
  12. Tool #5: Edda (https://github.com/Netflix/edda)
    1. Records the historical data about state of each AWS resource
    2. Data about 30 days/15 days/7 days/3 days/yesterday/12 hour etc.
    3. Edda continuously uses Describe API call for each AWS resource
    4. Timestamps the Information retrieved from Describe API calls
    5. Netflix Tool – Janitor Monkey can use this information to identify stale resources that can be released
    6. The information is indexed and allows for Search
  13. Tool #6: Archaius (https://github.com/Netflix/archaius)
    1. Archaius is a universal Property Console to set and read Properties
    2. Used to setup Properties across the entire Infrastructure
    3. Properties can also be bounded to a particular AWS Region
      1. For example:- Suppose we want our application hosted across multiple AWS regions to use a particular value for a property based on the origin of the request. If the application’s request comes from Europe, Archaius will give a different value of the Property compared to the other regions
    4. Performs global configuration management
    5. Uses Simple/Dynamo DB
    6. Uses Cassandra within Netflix – for multi-region support
  14. Tool #7: Priam (https://github.com/Netflix/Priam)
    1. Performs deployment automation for Cassandra
    2. Self organises Cassandra cluster
    3. Priam is deployed on a Tomcat server on each Cassandra instance
    4. It helps manages a Global Cassandra Cluster
  15. Tool #8: Astyanax (https://github.com/Netflix/astyanax)
    1. Cassandra Client
    2. Has a range of Astyanax recipes
      1. For eg:- Replacing the Multi-part write functionality available in Amazon S3
  16. Tool #9: EVCache (https://github.com/Netflix/EVCache)
    1. Extends the functionality of Memcache
    2. Allows low latency data access
    3. Automates replication of data in different AWS zones
    4. When reads are issued, it gives you the data from the same Zone as the requestor
    5. Used for storing session state
  17. Tool #10: Denominator Library (https://github.com/Netflix/denominator)
    1. Library that provides common Interface to AWS Route 53, UltraDNS, DynEct DNS etc.
    2. Programmatic access to various DNS functions
    3. Used for building Resilience in the DNS infrastructure
  18. Tool #11: Zuul (https://github.com/Netflix/zuul)
    1. Routing layer for any requests to the service
    2. Groovy scripts can be configured for Pre, During, Post filters for API Requests
    3. Allows quick changes to be configured on API to test new behaviours
  19. Tool #12: Ribbon (https://github.com/Netflix/ribbon)
    1. Internal Request routing
    2. It allows for Common Client – Server communication
    3. Also encapsulates common expected behaviour for errors, timeouts
    4. Built in Load balancer configured as Zone-aware
      1. Takes Health of Service Dependencies across Zones
    5. It can also use Eureka service
  20. Tool #13: Karyon (https://github.com/Netflix/karyon)
    1. Server Container that provides base implementation of common service abstractions of Netflix OSS :-
      1. Capability to connect to Eureka
      2. Capability to connect to Archaius
      3. Provides hooks for Monkey testing (Refer to Simian Army toolkit of Netflix OSS tools)
      4. Monitoring hooks for health and status information
      5. Exports JMX information
    2. Allows the Developer to be productive by preventing setting up of development infrastructure for common Netflix OSS components
  21. Tool #14: Simian Army (https://github.com/Netflix/SimianArmy)
    1. Chaos Monkey – kills instances randomly and can be configured
      1. Ensures new instances can be spawned to allow for replacement
      2. Ensures architecture is resilient to failures
    2. Janitor Monkey – works with Edda to clean unused resources on AWS
      1. Reduces money spent on AWS by deleting resources which are not being used
    3. Conformity Monkey – detects discrepancy across configuration and infrastructure on AWS
      1. For e.g.:- Is the AMI used for each instance same or different ?
      2. By alerting on discrepancy, devops can enforce standardisation
  22. Tool #15: Hystrix Circuit Breaker (https://github.com/Netflix/Hystrix)
    1. Allows for Failing fast and recovering fast
    2. Encapsulates scenarios when failure happens while working with non-trusted dependencies
    3. Can be used to wrap a third party call with Hystrix wrapper and all the exceptional scenarios are handled
    4. Also encapsulates Circuit breakers (http://en.wikipedia.org/wiki/Circuit_breaker_design_pattern), and toggles ON/OFF for each Circuit Breaker
  23. Tool #16: Hystrix Dashboard using Turbine (http://techblog.netflix.com/2012/12/hystrix-dashboard-and-turbine.html)
    1. Provides visual information about Circuit Breakers
    2. Identifies whether a service is healthy or not
  24. Tool #17: Blitz4J (https://github.com/Netflix/blitz4j)
    1. Non-blocking Logging
    2. Built on the top of Log4J
    3. Allows for Isolation of application threads from Logging threads
  25. Tool #18: GCViz (https://github.com/Netflix/gcviz)
    1. Visualizer of Garbage Collector
    2. Allows for identifying change in GC behaviour when a particular JVM configuration is used
  26. Tool #19: Pytheas (https://github.com/Netflix/pytheas)
    1. Tooling framework
    2. Allows creation of Dashboards, Data Insights
    3. Archaius tool was built out of Pytheas framework
  27. Tool #20: RxJava (https://github.com/Netflix/RxJava)
    1. Functional Reactive framework
    2. Allows for building observables when dealing with multi-threaded systems

This is indeed a great learning about tools and techniques, and I hope to use some of them in my project to get a real hands-on with Netflix OSS. Next target – Architectural Learning from Netflix. This is coming soon.

 

Video

From Archive – My talk at Business Technology Summit 2010

Riscy Ciscy : Learnings from a Mixed Breed Cloud from vivek juneja on Vimeo.

I found this video dug out from my archive. It was a short talk I gave at Business Technology Summit 2010 in Bangalore. The talk was meant to describe a project I had completed short while back in early 2010. The project was to build a mixed breed cloud infrastructure for a large Telecom provider. Mixed breed essentially means mix of Architecture choices – x86 and Sun SPARC here.

Scalable Cloud Backend for Mobile Developers

Mobile Apps do not work in isolation. They need to work with scalable backend to allow for customized experience for each user. Apps that need to provide storage, personalization, integration with other services, notifications usually require powerful backend features. Mobile app developers usually do not have experience in working with server side backend implementation and this becomes a bottleneck when working on building great mobile apps. Mobile platforms that allows for easy integration of scalable backend with mobile apps are on rise. In this talk, we will look at how mobile developers can build strong backend foundation for their next big mobile app. This talk will look into code examples and patterns that developers can readily apply to mobile apps.

BTW, Parse as mentioned in the video is acquired by Facebook. It indicates the PaaS has arrived and is going to be the platform of choice.

The presentation is following:-

Notes from Flipkart Slash-N 2013 Conference

I was able to attend Flipkart Slash-N 2013 Conference last week (1st Feb 2013). It was an interesting conference, with lot of emphasis on learnings by Flipkart over the years. The speakers were overall good, and some of the invited panel were fantastic.

Although it was tough to keep up with the constant flow of information, I was able to jot down notable stuff from some sessions.

Sharad Sharma – Keynote

1. Organizations turning Strategic Capabilities into Commoditized offerings

2. Commoditization is a boon to the organizations. It helps create building Eco-system
- See Android, Hadoop

3. Two exciting books recommended
- Dealing with Darwin
- The Only Sustainable Edge

4. Search key to Flipkart and other ecommerce companies

Panel Discussion – Data: Stores, Processing and Trends – Dr. Pramod Varma, Joydeep Sen Sarma, Ashok Banerjee, Utkarsh, Regunath Balasubramanian, Anshum Gupta

1. How do we select the Data Processing and Storage Platform ?

2. Selecting HBASE for Scale (Put it the edge of the app, near to the user)

3. HIVE – Build for Analytics and Reliability

4. Selecting NoSQL – How frequent Schema changes happen

5. Selecting between storing Aggregate vs. Exact information
- Sometimes we do not need the accurate information to be available to the end user. Aggregate is good enough.

6. Check out storage companies
- http://www.nimblestorage.com/

7. Data De-duplication a key concern

Flipkart – Lessons learned

1. Prioritize Read vs. Write Queries
- Critical and Non-critical Read and Writes
- Separately scale each type of Data infrastructure (Read scales separately from the Write)
- Find Slow Queries
- Can come from Front-end or Analytics system
- Separate the data infrastructure (separate Read Query infrastructure)

2. Isolate Databases

3. Logging can cause latency – too much instrumentation can be problem

4. SOA – too many connections on services – causes contention and race conditions

5. UI is highly configurable. Configuration is stored in Config DB.
- Config Cache was built
- Agent model that sits on the Web Server and provides a proxy for Cache hits
- This was a bespoke solution
- Did not use Memcache as it would have required PHP to make a separate Network call

6. Deployment was using Ring pattern (stage wise deployment)

7. GearMan – framework that allows for asynchronous inter services communication (similar to Agent model build by Flipkart)

8. Flipkart was caching fully constructed pages for few minutes
- Only for high trafficked pages – like Homepage
- But lots of changes happen every few minutes

9. Flipkart cached PHP serialized product pages

10. Found Caching impacts A/B testing heavily

11. Explored Thrift Serialization

12. Cache modification was done by separate processes that lead to cache change notification. This is build to introduce asynchronous behavior and avoid blocking call to Caching system
- The cache change notification is serviced by their front-end layer

13. Cache was sharded

14. Evaluate Membase and CouchBase

15. Cache TTL invalidation – hardest problem to solve
- made it infinite
- on-demand partial invalidation done by Cache modification layer

IaaS giving way to “Data Science as a Service”

As organisations start embracing public infrastructure cloud for their critical data and application needs, Data Science SaaS will become more practical for use. The biggest challenge for Data Science SaaS is to have customers expose / store their data to / on their multi-tenant public platform. Increasingly, public cloud services are becoming a compelling ground for enterprises. This means with applications, data also becomes a good prospect to be moved to these platforms.
This is due to low data access latency requirements for apps in cloud and increasing confidence in shared public cloud services.
Once the data is out of the enterprise data center, inter cloud service integration becomes easier. For eg:- Imagine an AWS EC2 infrastructure running your enterprise application, with data lying on S3 / RDS. To extract Business Intelligence and inference out of this data, it is practical to use an existing public SaaS that can work on this data, rather than using in-house analytics infrastructure. Greenfield apps having their genesis on public cloud are already candidate to be used with Data Science services (aka Analytics as a Service).
The future of Data Science SaaS looks promising. Startups like Datameer, ClearStory are doing pioneering work on this.
More on this at :-

http://gigaom.com/2012/07/05/want-to-ditch-your-data-scientists-heres-are-7-startups-that-can-help/