The real cost of Change : Perspectives of adopting the DevOps, Microservices and Container Bandwagon

shipping containers

This happens quite often in the life of an organisation. New plateaus emerge, technologies become outdated, and the growing ball of mud becomes too big to endure. Technology teams become fascinated with the prospect of playing with new shiny things, some desperately trying to add new skills to their resume on the excuse of making “radical” changes, while others dumbstruck with the plethora of choices in front of them. Half baked understanding leads to shallow inferences, that often get propagated to the “real” decision makers. Meetings, after meetings, often mixed with a hint of “consultants” and “industry-pundits” become the immediate action items. Google searches for adoption stories, and blog posts, keep humming in background, as the gardeners of technology hunt for information. Conferences and meetups, become important part of the monthly ritual, some preferring to digest the video recordings over physical interaction. Graphs emerge, some hyped with hype cycle, and some hand-crafted with elementary understanding of basic statistics.

For an organisation, that is torn between re-inventing itself into a technology company, and preserving its heritage, this journey into the unknown is becoming more difficult, than the actual process of adoption. This journey into the unknown represents the chaotic endeavour to make decisions about the implications of evolving technologies, and making them relevant to the actual business. The situation is better in organisations that have embraced technology as its key asset, investing heavily in gathering enterprising minds who have depth and breadth to sustain this unpredictable journey. For the rest, they have to settle with the ordinary, who are stuck in their ways, fighting change as if it is a virus infecting their existence. The rest of the organizations are still in the midst of transformation, having battle scars from using technology to operate their businesses, and yet reluctant to acknowledge the real problem. Its for them, that this transformation, aided by the constant reminder of their ball of mud and the damage its doing, is the most complex.

Microservices relates itself with Service Orientation done right. DevOps relates itself with Agile done right. Containers relate itself with virtualisation done right. In the midst of the growing stories, and rising expectations from these transformation-enabling practices, organisations are increasingly finding hard to identify the real cost for this transformation. The real cost for this transformation relates to acknowledging the change that these practices would have on people, tools, process, culture and organisational politics. Most of the organisations are swayed easily with the ever increasing benefits of adopting these practices, without due diligence to the after effects if these practices are not embraced completely. This in part is quite similar to the accounts of misplaced Agile adoption stories. The poorly understood Agile manifesto remains as a mere mention on the presentation slides that technology teams present to the business. In retrospect, most Agile horror stories emerge from organisations that have had teams practicing some form of this methodology with poor attention to continuous deployment, continuous testing, continuous integration, continuous release, backlog grooming, and right user stories. These teams with poor implementation of Agile, often get stuck with the “religious” rituals like stand-up meetings, retrospectives, backlogs, burndown charts among others, and treat these rituals as the only basis for adopting Agile practices. These stories should be acknowledged, and considered as a reference to how good intentions, with misplaced expectations and understanding, can wreck havoc on an organisation.

The usual practice of pitching these transformation-enabling trio : Microservices, DevOps and Containers, is to account the benefits that they have, when modernizing existing products and teams. Monoliths will be broken into Microservices, allowing a better granularity that leads to coherent set of functions that are independent of the rest of the herd. DevOps is the way each team will build and operate these Microservices, allowing each team to have a shared success definition amongst the diverse set of roles in that team. This will lead to ownership, meaning people in these teams will care for their work, and the exposure to run and operate the services in production, will lead to better accountability and discipline. Containers will become the way these Microservices will be packaged, deployed, and released on infrastructure. This will lead to better infrastructure utilisation, and simplify the way a change is moved from a developer’s machine to the production environment.

While all the benefits do matter, the field of view of a technology team in such an organization usually misses the hidden and sometimes less-accounted concerns that will emerge from this change. These hidden and minimal-accounted concerns provide a basis for leveraging the practices of Microservices, DevOps and Containers in total, rather than in parts. Can Microservices be adopted, without DevOps ? Can Containers be adopted without considering Microservices ? These questions are fundamentally more insight worthy than mere adoption of the practices. To follow this up, questions could be asked to probe this even further like : What happens if I choose to build Microservices without having to use Containers and not leveraging DevOps ?

To identify the real cost of this transformation, it is relevant to talk about Complex Systems, and how they function. A Complex system is a System of Systems. Much like a human mind. In the seminal book by Marvin Minsky, “The Society of Mind”, the late extraordinary AI pioneer provides a view of how the mind is structured in the form of a Society. The Society contains independent service driven functions or members. Each member is unaware of the implications of its effort and function. However, to accomplish something as complex as picking up a cup of tea to drink requires a hierarchical execution of numerous such functions. Minsky theorizes the execution of these independent functions as the state of mind. So, the mind itself manifests in the various ways these functions work together, each oblivious of its impact to the greater self. Similar to the complexity of a human mind, a System of systems cannot function just by being there. It needs ways to manage the growing complexity of tens, hundreds, thousands and more such functions. It needs techniques to self-evolve through the continuous evolution of its functions. Each such function is easy to reason, and easy to evolve. Away from the mind, a large system that is composed of many systems, each of which could be a Microservice, is evolving continuously. Its evolution is marked by individual evolution of the each of the many systems that it’s composed of. Each such system inside the large system, is changing. And each such system needs a way to reliably and timely move its changes in operation, without disturbing the overall state of this large system.

A large system like Banking application when composed of Microservices, may have 100’s of such systems. Each system is one or more collection of Microservices. At a microservice level, the system is simple, and independent. However, the Banking system at large, is complex. Compare this to perhaps an earlier iteration of the Banking application, which could have been a monolith. It also has many systems, however they are not independent, and simple. Even with the best design practices, these systems inside Monolith would have shared state, and shared execution. Partitioning these individual systems from the Monolith and having them to run on individual machines (physical or virtual) means to have those many number of machines available in your arsenal. A Monolith may require only a few machines to run and operate, but with the partitioning, the individual systems require independent machines. These individual systems now need independent coverage of high availability, resilience and fault tolerance. In the case of a Monolith, all such coverage would have been for just one big system. A Complex system if in a form of Monolith requires a single focussed effort for scaling, availability, fault tolerance and operational activities. The fallacies of a Distributed systems now apply to each such independent systems.

A Complex system being built with Microservices has more moving parts. The individual components of this system needs individual attention and operation. Individual attributes apply to each such member inside the Complex system. Effectively, we now have multiplied our problems with the number of such individual systems or services that exist within this Complex system. A Complex system of such kind has its evolution dependent upon the many evolutions that happen among its many members. If a particular behavior of the Complex system requires change across many internal systems or services, then this means a combined effort of changes for all participating services. One slow moving or performing change in any of these participating service means a slow changing Complex system. Each participating service in the Complex system now needs way to move fast, and needs access to capabilities so that it can produce change faster. This puts stress on changing the dynamics of each team that runs the individual system.

A study of Complex system cannot complete without mentioning about testing. Testing Complex system is harder when it is composed of many smaller systems like Microservices, each of which is independent and is constantly evolving. When a Complex system becomes even bigger, with 100’s of individual services, the only way to reason out problems is by teasing failures in various parts of the system, and finding out how individual systems behave. This is in itself a complex activity, and requires new practices, tools and processes to be useful. A Complex system, hence is like an organism. Its messy, chaotic, is continuously evolving, and requires radical ways in understanding and making changes. Planning a change on such a complex system cannot happen with the traditional mindset of a control from the ivory tower.

The real cost of change to adopting Microservices, DevOps and Containers is to acknowledge the increasing complexity of a system that is based on these practices. The end system is a Complex being, which is fast, agile and continuously evolving. However, beneath its covers is a mess and chaos of many participating services that function independently, and each such service is oblivious of the overall goal of the function of the Complex system. At a service level, it is easy to reason out problems and concerns, but at a Complex system level, this is a completely new game. Existing practices will not scale for such a Complex system, instead it will increase the cost of sustenance of such a system. The chaos, if managed from the ivory tower, will look like a mess, and may disappoint power herders who like to know and control everything. The way to arrive at the real cost of change is to question the existing practices, processes and culture in an organisation in the light of this new Complex system. How will your organisation change when you have to deal with this Complex being ? That is the question that will merit more insight, and provide better answers that influence success of such practices.

Notes: http://highscalability.com/blog/2015/4/27/how-can-we-build-better-complex-systems-containers-microserv.html

The real cost of Change : Perspectives of adopting the DevOps, Microservices and Container Bandwagon

Creating a Good Developer Experience

developer-experience

One common aspect of building in and for a Service Oriented system is the idea of “Standardizing Developer Experience”.

What is Developer Experience ?

A Developer in a Service Oriented project has to deal with multiple concerns. Each such concern is vital to the success of the Project. These concerns are essential to managing the chaos in a Distributed system.

Let us look at some of these concerns :-

1. How does a Developer decide on the Size of a Service ?

2. How does a Developer ensure that she is building a consistent set of API Interfaces to her service

3. How does a Developer provide API Mocks or Stubs for other developers to use when consuming her service

4. How does a Developer provide consistent documentation for the Service APIs that she developed

5. How does a Developer help prevent failures for others as a result of using the Service she built

6. How does a Developer ensure she is providing Health check capability in the Service

7. How does a Developer ensure the service is ready for instrumentation and monitoring

8. How does a Developer know how to publish her service

9. How does a Developer able to provide consistent Test Cases and Test coverage for the service

10. How does a Developer get access to test the service she developed along with the other dependent services

11. How does a Developer able to deploy the Service anywhere, be it Private Data center or a Public Cloud service

12. How does a Developer able to consistently provide Authentication and Authorization capability to its service

13. How does a Developer manage and support the variety of Consumers / API Clients

14. How can a Developer reliably make a change to a code and ensure that this change does not affect other dependent services

15. How can a Developer ensure that a change can be moved to Production environment reliably without major human interference:)

16. How does a Developer ensure that the Service she built is easy for Debugging

17. How does a Developer know how much capacity the Service needs to be run

18. How does a Developer know how to scale the services horizontally and independentaly

19. How does a Developer reliably use the Good patterns when developing for a Distributed system like Circuit Breaker, Timeouts etc.

All these above patterns are essential for creating a Good developer experience. A Good Developer experience ensures better code quality, reliable and stress free changes to a system, and independent Teams.

As we develop the POC for demonstrating Service Design and Domain Isolation, it is essential for us to also show that this isn’t possible without creating a Good Developer experience.

The goal is not just to create an experience, but Standardize it.

What do I mean by Standardizing ?

Standardizing means making all these concerns a part of a standard template. For example, we know about Email templates. Email templates are used to help create consistent emails for all recipients. Similarly, Standard Templates for Developer experience will avoid mismatch amongst developers. Standard Templates for Developer experience will reduce the burden of managing all these ideas for a Developer. This will ensure that the Developer can quickly and reliably focus on building the actual Logic, instead of managing this mess.

The good part of this template is that it can be shared with anyone. Internal teams or Partners, anyone can have access to this template and build reliable code with certainity.

But, a Good developer experience cannot prevent Bad code:) It can help Good developers become better, and build consistent and standard system required for a successful Service Oriented product.

And creating good Developer experience is NOT expensive or time taking. It is easy with the wealth of tools available. All we need to do is to build this Template from the various tools available and bring them all together.

Creating a Good Developer Experience

Revolution in a Large Enterprise

revolution-large-enterprise

In a few weeks, I will be consulting a large organization that is in midst of a revolution. A revolution that strives to induce new energy and innovation into the organization while dealing with its legacy and traditional mindsets. In fact, the major revelation as a part of this, is the idea that the organization is no longer a traditional brick-and-mortar company. Instead like its predecessors and competitors, it is destined to become a Platform company. It is destined to become a Technology company with deep roots in traditional business. It is finally destined to take IT as not just a cost-center or a support function, but as a key ingredient for its future.

The bigger challenges that surround this revolution is the idea to be able to maintain a delicate balance between the current short term targets and the bigger, bolder bets that it needs to make in the times ahead.

For a while now, the majority of activities surrounding this revolution is tactical, and less strategic. Partly, because of the increasing need to have the house in order, and target achievables that have missed the timelines multiple number of times.

Although the organization understands its position in the market, it still has miles to go in order to be able to gain significant presence in the new business areas that the company wishes to expand.

I, for most, will concentrate my energy into creating an engine of Software and Technology innovation that provides a standard boilerplate for all existing and new projects.

As I plan to own these activities, I have been looking around for ideas, inspiration and practices. Partly, this will help me to pile up my arsenal on how I would approach solving this technical riddle for this organization.

My research and curiosity leads me to multiple different lanes, each with its own ideology and practices.

One of my first involvement was the Lean Startup movement by Eric Reis, inspired by Steve Blank’s Customer Development Methodology. I borrowed some ideas around Lean Startup, and the MVP (Minimum Viable Product) in previous projects, and found them deeply useful to reduce noise and create essential focus when building projects. However, a common trail of disappointment that I faced was the constant withdraw of people on adopting these ideas in mainstream enterprise, primarily due to the fear of unknown or new. Where people got interested, they got deeply disappointed as they faced practical challenges and resistance from other teams.

I found interesting work by Telefonica in adopting Lean Startups ideology to create very useful and innovative new services for its customers. The Lean Enterprise book by Trevor Owens and Obie Fernandez was also a fantastic investment for me, helping me navigate through the particulars of what it takes to adopt Lean Startups in large organizations. I was able to resonate with the entrepreneurial framework that the book introduces for innovating and making their enterprises go Lean.

My next stop was IDEO, and its Human Centred Design Thinking. I was introduced to this Design Led thinking a couple of years back, thanks to a project I did while participating in Acumen-IDEO Course with some friends. My experiences with the course and the project was extremely exhilarating, and I found it to be an excellent framework to innovate and create demonstrable results in short amount of time. Although the focus of that course was Social Innovation, the learning from the project did hint me to believe that they are as useful in a large enterprise to solve problems.

Steve Blank, however had a very interesting take on the commonality and differences between the Lean Startup and Human Centred Design Thinking on his blog. He advocated that indeed both these processes are useful for large enterprises, but Lean Startups is more about getting their first, and then iterating. Design Thinking was more about getting it right.

I then looked at some essential readings that my friends and mentors recommended. The Goal from Eliyahu Goldratt, and The Phoenix Project by Kevin BehrGeorge Spafford,Gene Kim were my first stop. Both are highly recommended readings, and kept me going back to them on each step of my research. I have a laundry list of things to do, thanks to these two books. I am still amazed on how relevant and similar Operations Research and Lean Manufacturing is to modern IT and Technology development. I still have multiple bookmarks and To-Dos pending till date, even after going over the books a couple of times.

After having looked at ideas from Lean Startups in Enterprises, IDEO’s Human Centred Design Thinking, DevOps and Lean Manufacturing, I focussed on my other favorite topics – Cognitive Psychology and Behaviour Economics.

Cognitive Psychology and Behavior Economics have been a good casual readings for me since sometime, thanks to writers like Steven PinkerDan Areilly and Daniel Kahneman. One of the aspects that stuck to my thinking was how deeply we underestimate and undervalue the study of human psychology as an essential ingredient for creating a technology rich and innovative organization. After making some notes on the science behind how we think, perceive and decide, it was quite evident that the more we understand about what’s inside, the better we could build things and structures outside. More than ever, it provided me some interesting ideas on asking the right question, introspecting, and creating opportunities for adoption of new technologies. I had, in my past, found considerable trouble in teams to take risk, make bold bets and have faith in doing something innovative. One of my mentor, rightly said that the problems in most cases is not the technology, but the humans who develop and use it.

After spending multiple weeks and months into this, I am still dumbstruck with the exact answer to my original quest. Though I am aware that the answers will not be evident immediately, but I am glad to raise some key questions, and have pointers to experiment and evaluate.

In this quest, I have created a new Github project that will be part a collection of my notes, questions and inferences. The other part will be toolsets that I build and use as I work towards creating a Technology Platform for this organization.

 

Revolution in a Large Enterprise

What to look out for when building Micro Services

look-out

Sam Newman had a great presentation currently listed on Vimeo about “Practical considerations for Microservices Architectures”. This is from the 2014 edition of JavaZone conference in Oslo. I found the talk valuable for first-movers who are planning or in the middle of building / transitioning to Microservices based architecture.

I wanted to put together the important points covered in the talk as a checklist for myself when building Microservices, and hopefully this is useful for others as well.

A Summary of all the important points covered in the talk :-

  1. To understand what composes a Microservices, its important to know about Bounded Context as a way to define the Service Boundary. Eric Evans has a fantastic coverage on this in his book : “Domain Driven Design”.
  2. Microservices has more to do than just technology and new architecture style. It has to do with how organizations and teams are formed. A knowledge of Conway’s Law, and its implications is useful to understand the “people” part of this paradigm shift.
  3. The main goal for Microservices is to improve the speed of innovation by allowing heterogeneous technology choices, and building agility.
  4. Its important to standardize the gaps within the services, instead of worrying what goes inside a service. Common things to standardize between the services :-
      1. Interfaces – Restful over JSON for example.
      2. Monitoring – Application, Infrastructure
      3. Deployment and Testing
      4. Safety practices – making sure a service does not fail others in a system
  5. If everybody “owns” a service, then “nobody” owns the service. Make independent teams that are accountable to services. Having shared responsibility leads to reduced accountability. Assign a set of services to a team, and let them own it, and let them be responsible for decisions to build, operate and maintain.
  6. For shared services, it would be advisable to rotate the ownership to teams like a “rotating custodian”. Hence, these kind of services should have clear custodian model.
  7. Strong coupling as always is bad. Its advisable to avoid shared databases, serialization protocols to communicate across services. Instead use lightweight open protocols like REST which are resource oriented.
  8. A good tip to break down existing functionality into services : “Separate Databases before separating services”. This is a good thumb role, when isolating services, and a good model when breaking down monolithic systems.
  9. When thinking about designing the service behaviour and interface definition, a good advise is to adapt “Consumer” first approach – This means to think through the various types of consumers who would use the service, and what could they use it for. Planning for a API Documentation and Developer Test Console is also greatly advised. Tools like Swagger are useful in this context.
  10. Monitoring is an essential need, not just a last-minute thing in Microservices world. Investing in Monitoring across all layers and different constituents of a Microservices architecture is as important as developing services in the first place. A good start is to think how different monitoring information could be collected, aggregated and visualized. Sam recommends Logstash, Kibana stack from the Open Source world for log monitoring and analysis. Yammer metrics or Netflix’s Servo is a good pick if interested in Program counters. Graphite with statsd also stood out as good picks to be used when thinking about monitoring.
  11. Synthetic transactions or Semantic monitoring is an interesting way to check health status of a Production system. This could mean for example: having tests that create an order, cancel and return an order to check if everything is working fine or not periodically in an e-commerce system. We need to be careful in picking up these end to end tests to ensure we don’t do any thing destructive while verifying the end to end flow from time to time. One other tool which could be useful here is Mountebank, that allows quick development and testing for service stubs, useful when building service that needs to be tested in isolation.
  12. Using correlation id that is generated and passed along all upstream and downstream systems via logs is a good step towards helping debugging issues. Together with metrics in the log via Yammer Metrics, its a potent combination for devising call graphs when diagnosing issues with Service issues and latencies.
  13. Allowing teams to build their services independently needs a standard way of deployment to ease fast production rollouts. Therefore, one of the recommendations is to evaluate toolsets that abstract underlying deployment differences like Packer from HashiCorp. Container based deployment is also a common place using technologies like Docker.
  14. Independent teams building micro services frequently run into problems wherein they need to test a service without breaking other dependent service consumers. Change in service should not break Upstream or Downstream systems. Hence, the concept of Consumer Driven Contracts is recommended, wherein service consumers can specify their expectations as “tests”. These tests are run as “unit tests” when building the service. Tools like “Pact” could be used to test Consumer Driven Contracts.
  15. Reduce the tendency to have a large scope of release. This means to eliminate the need for targeting release of multiple services together, as it breeds coupling. Instead, we would need to try as much as possible to release services independently. The idea is to not let the change build up. Release one at a time, as often as possible.
  16. Usually in a multi service environment, cascading failures are a big risk. This can lead to conditions where a particular service failures could lead to downtime of the entire system. In most cases, the system should survive outages of service by working in a deprecated mode or with partial failure. There are multiple ways we could use to reduce cascading failures, mostly originating from the book “Release It” by Michael Nygard. This includes patterns like “Bulkaheads”, “Circuit Breakers” and “Timeouts”. Hystrix from Netflix is an interesting implementation of the Circuit Breaker pattern and is widely used.
  17. In micro services based system, the onus is on to move fast and strong ownership. To allow teams to independently own services and be accountable for the operation requires some sense of discipline and co-ordination. The essential idea is to have these teams to be able to leverage polyglot technology and techniques to deliver services with standard interfaces, monitoring and fail safe mechanisms. A common idea is to built Service template as Boilerplate that are self contained and are a good starter project for any service team. The Service template encapsulates common essentials for any service : Monitoring, Metrics, API tools, Fail Safe like Circuit breakers and deployment among others. This allows independent teams to be able to have the easiest way to do the Right thing – built polyglot services in a standardized non-chaotic manner. Netflix Krayon and DropWizard are good examples of Service templates.
  18. When starting with micro service orientation, its important to focus on – “How many services are too many ?” instead of “How big or small is the micro service ?”. Its preferable to gradually start with small set of services and ramp up as confidence improves. More number of services at initial stage would lead to the need for managing more moving parts, which may disappoint progress.

I think most of the ideas are very practical and definitely a must in the arsenal for a Microservices practitioner.

The upcoming book by Sam Newman titled “Building Microservices” should also be a great read, based on the content that he already covered in the aforementioned talk.

 

 

What to look out for when building Micro Services

EC2 Deep Dive Notes

IMG_0465

I recently chanced upon an interesting high level coverage of the insights that go into making EC2 Instances performant. This coverage was in a talk titled “Amazon EC2 Instances Deep Dive” at AWS re:Invent 2014, and was well articulated by the speakers.

I would like to share a couple of points that stood out for me. However, going through the video is more likely a recommended option, and good use of 40 minutes of your week.

  1. It’s important to understand the difference between PV and HVM. In short, PV refers to a modified Operating system that is Hypervisor aware, thus reducing the cost of translating the system calls into hypervisor calls. HVM refers to Hardware assisted Virtualization through support from the Intel VT extension. In HVM, the Operating system need not be modified. AWS Engineers recommended PV-HVM AMIs compared to just PV AMIs.
  2. It’s important to understand the difference between the new generation and class of EC2 instances. The new generation instances include c3, i2 and t2. t2 is among the most cheapest instances available in EC2 family.
      • t2 leverages CPU Credits. CPU Credits provides full CPU core performance for a minute. It’s important to understand how t2 uses CPU Credits, and ability to monitor this parameter for greater performance.
      • Its also interesting to understand Burstable performance, and workloads that can thrive Burstable performance. t2 instances provide Burstable performance.
      • i2 instances provide High I/O, especially for NoSQL Databases.
        1. AWS Engineers recommend to use 3.8.0+ kernel version to leverage High I/O Rate.
        2. The new Amazon Linux AMI are already running 3.8.0+
        3. Issuing TRIM using fstrim -a for “SSD” backed i2 instances if using new Kernels.
        4. For older kernels or Windows, reserve 10% of disk using partition and never touch it. This is refered to “Over Provisioning”
        5. TRIM and Over provisioning avoid Garbage Collection and improve performance of I/O.
      • c3 instances provide High Compute with Enhanced Networking (SR-IOV)
  3. PV-HVM is better than PV based Xen modes. So using PV-HVM AMIs are recommended.
  4. Use TSC as Clocksource instead of xenpvclock by modifying kernel parameters.

Here is the video that covers it all, and is my pick for the week :-

Till then happy Optimizing !

 

 

EC2 Deep Dive Notes

Go beyond just being Cloud Ready : Be Cloud Native

 I recently got a chance to author a write-up for Newstack.IO about building Cloud Native applications. There is just so much happening around Cloud services and building Cloud native applications, that not a single day goes by without a mention on the internet. 

In the article, I have talked about some pattens and technologies that would be useful to consider as individuals and organizations embark on building applications on cloud platforms. 

Here is the link to the article :-

http://thenewstack.io/reactive-frameworks-microservices-docker-and-other-necessities-for-scalable-cloud-native-applications/

I am writing more on this soon, with more practical tips around technology stacks and open source projects which could be helpful.

 

Go beyond just being Cloud Ready : Be Cloud Native

Learning from Netflix – Part 2

Learn

In the last blog post, I had listed down the tools and practices introduced by Netflix in the presentation at AWS Re-invent 2013. In this second part of the blog series, I will attempt to uncover the real learning pointers that can be derived from such techniques and its effectiveness to any Cloud application developer.

1. Using a Cloud Provider is not same as using a hosting provider. It requires delicate planning, process and engineering efforts over a period of time. Without all this, an organisation cannot leverage all the benefits of a Cloud service.

2. Having an Agile infrastructure alone cannot solve problems if your developers have to perform too many rudimentary operations to use it. That also leads to another problem – Giving direct access to your developers and not be able to manage it effectively. AWS Admin Console is good from an Operations point of view (read System Admin, DevOps). One of the choices could be creating Development User Roles in the IAM, but the developers still have to live with the Ops view of the entire infrastructure. At a certain usage level of cloud services, organisations may want to build higher abstractions (read AWS Beanstalk style) over the existing functionalities, that reduces the effort of Developers. Netflix built Asgard as a useful abstraction over the AWS infrastructure to be able to provide powerful and easily consumable capabilities to its developers, thereby making them empowered. Workflow and Approvals could also be built to create the most minimal denominator for moderating access to the AWS infrastructure.

3. Avoiding Infrastructure sprawl is also one of the important needs for an organisation dealing with Cloud services. The elastic nature of infrastructure, over the period of time, tends to create resources which are no longer required or used. Having to deal with this manually means either to have approval and expiration windows, or to create your own “Garbage Collectors”. Netflix created their own version of Cloud Garbage collector with Janitor Monkey. But the secret sauce is the service called Edda. Edda records historical data about each AWS resource via the Describe API calls, and makes it available for retrieval via Search and as an input to Janitor monkey for disposing old resources which no one uses. Building an Engine that records historical data about the AWS resources and provides easy consumable interface to identify old resources is the first part of the puzzle. The second part is to use this information automatically to delete these resources when not needed.

4. Dealing with multiple environments and ever-growing infrastructure requires a deep discipline in how teams use the resources and perform day-to-day operations. Also, identifying who performed what operation on which resource is essential to the overall system monitors to identify fault situations and perform effective resolutions in time. Netflix way of handling this is by introducing Bastion machine as an intermediary / jump off to access the EC2 instances. This allows Netflix to moderate the access to the EC2 instances, implement audit trail for operations performed, and also implement security policies.

5. As a Service business models tend to provide organisations with flexible Opex model, but if not managed well, can soon create more problems than solutions. One of such issues is the unwarranted increase in utilisation and thereby total accumulated costs of cloud resources. Also, it is paramount for an organisation to be able to slice and dice the costs incurred across multiple divisions and projects if a common AWS account is used. A visualisation into how organisation uses AWS resources and costs incurred during the course of operation can be a very helpful for effective charge backing, and even reducing wastage. Netflix open sourced a tool by the name ICE(https://github.com/Netflix/ice) to provide visibility into the utilisation of AWS resources across the organisation, especially through the Operational expense.

6. One of the general rules of cloud native development and deployment is not to spend time on recovery. Essentially replacing supersedes recovery. Replacing can only win if the time and money to replace is dramatically more than recovery. Cloud and EC2 is a volatile environment like every other cloud provider. Hence, things can go wrong, and EC2 instances may vanish off in thin air. Hence, the rule of the game requires organisation to spend time on Automation. Automation includes fast replacement, and minimal time spent on replacement. One of the ways this is possible is by ensuring minimal number of steps are performed after instantiation of EC2 instances. This is possible via creating packaged AMIs that are baked with installation and configuration of the required application stack. Alternatively, using services like AWS CloudFormation and integrated tools like Chef, the entire process of bootstrapping the application could be scripted. If the time and cost of performing bootstrapping via scripts is more than the threshold, then baked AMIs work as a good choice. Baked AMIs, however, have drawbacks if frequent changes are required to be made on the application installation and configuration. In my experience, a balance of baked AMIs and bootstrapping scripts provides a good alternative. Netflix through the OSS tool Aminator allows baking of AMIs. One time effort of creating these AMIs lead to faster instantiation of EC2 resources. This can be also used with CloudFormation to fully automate the infrastructure & application provisioning.

7. Netflix has provided a good insight into how it leveraged SOA to accelerate its Cloud strategy. Eureka from Netflix is a good fit in the overall SOA infrastructure, that provides a Service Registry for loosely coupled services and dynamic discovery of dependencies. Different Services can lookup for other remote services via Eureka and in return get useful meta-data about services. Edda also help short-circuit the connection between co-located (in the same zone/region) services. Services in the same zone can talk to each other rather than talking to their distant counterparts (located in other zones / regions). Eureka also helps in recording the overall health of individual services, thereby allowing dynamic discovery of healthy service alternatives and thereby increasing the fault tolerance capability of the system as a whole.

8. One of my favourite applications in the Netflix OSS toolset is Edda. Although I briefly touched upon the service in previous points, I would still want to elaborate the learning from this tool. Edda as described in the last blog post, is a service that records historical data about each and every AWS resource that was used by the Overall System. Through continuous tracking of state information about each AWS resource, it creates an index of state history, and thereby allows one to identify the changes that has gone into the resource over the period of time. The possibility for this kind of tool is limitless. Not only it creates a version for all the cloud assets / resources an organisation uses, it allows search functionality on it, thereby allowing queries like “what changed in this resource over last few days” or “when was this property set to this attribute” ? All this helps in resolving complicated configuration problems, and can be used to perform analytics on how a Cloud resource changes over time. The output of analytics can then be used to perform better system design and effective use of AWS resources. Look here for more :- https://github.com/Netflix/edda/wiki#dynamic-querying

9. I got introduced to “Circuit Breakers pattern” from the book Release It ! (http://pragprog.com/book/mnee/release-it). Michael Nygard provided a useful abstraction to contain further degrading of a system by failing fast. For instance, a service consumer calling a remote API is prone to many exceptional conditions like Timeouts, Service unavailable etc. Having to manage each and every such scenario across all layers of your code is the first hurdle that a developer has to go through. The second hurdle is to ensure the system does not keep going through the repeated process of failure realisation on a separate invocation of our service consumer. Circuit breakers can be configured with threshold number of such failures that can happen in these kind of situations. If the threshold number is crossed, circuit breaker comes to action, and returns a logical error without trying out the actual call to the remote API. This allows the system to be reactive and not to waste critical resources on retrying failed scenarios. A Circuit breaker dashboard can trigger alerts to the operations team letting them be aware of such scenarios, and plan for resolutions. The overall system however goes through a degraded performance without any actual blackout. Netflix created its own version of circuit breakers via Hystrix project. Together with Hystrix dashboard, it’s an effective tool to fit the arsenal of a Cloud Geek.

Learning from Netflix – Part 2