{"id":16797,"date":"2022-03-07T20:56:56","date_gmt":"2022-03-07T15:26:56","guid":{"rendered":"https:\/\/coforge.site\/cigniti\/blog\/?p=16797"},"modified":"2024-06-07T14:45:59","modified_gmt":"2024-06-07T09:15:59","slug":"guide-chaos-engineering","status":"publish","type":"post","link":"https:\/\/coforge.site\/cigniti\/blog\/guide-chaos-engineering\/","title":{"rendered":"A Practical Guide to Chaos Engineering"},"content":{"rendered":"<p>Modern systems are built on a large scale and operated in a distributed manner. With scale comes complexity, and there are so many ways these large-scale distributed systems can fail. Modern systems built on cloud technologies and microservices architecture have a lot of dependencies on the internet, infrastructure, and services that you do not have control over. Cloud infrastructure can fail for many reasons.<\/p>\n<table data-tablestyle=\"MsoTableGrid\" data-tablelook=\"1184\" aria-rowcount=\"3\">\n<tbody>\n<tr aria-rowindex=\"1\">\n<td data-celllook=\"4369\">\n<ul>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"14\" aria-setsize=\"-1\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">Power Outages<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<\/ul>\n<\/td>\n<td data-celllook=\"4369\">\n<ul>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"14\" aria-setsize=\"-1\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">Unexpected surge in user traffic<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<\/ul>\n<\/td>\n<\/tr>\n<tr aria-rowindex=\"2\">\n<td data-celllook=\"4369\">\n<ul>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"14\" aria-setsize=\"-1\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">Natural Disasters<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<\/ul>\n<\/td>\n<td data-celllook=\"4369\">\n<ul>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"14\" aria-setsize=\"-1\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">Cyber-attacks like <\/span><i><span data-contrast=\"auto\">DDoS<\/span><\/i><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<\/ul>\n<\/td>\n<\/tr>\n<tr aria-rowindex=\"3\">\n<td data-celllook=\"4369\">\n<ul>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"14\" aria-setsize=\"-1\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">Hardware Complications<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<\/ul>\n<\/td>\n<td data-celllook=\"4369\">\n<ul>\n<li data-leveltext=\"\uf0b7\" data-font=\"Symbol\" data-listid=\"14\" aria-setsize=\"-1\" data-aria-posinset=\"1\" data-aria-level=\"1\"><span data-contrast=\"auto\">Exhausted Resource\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<\/ul>\n<p><i><span data-contrast=\"auto\">(low memory, high CPU, low bandwidth etc)<\/span><\/i><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559685&quot;:720,&quot;335559740&quot;:259}\">\u00a0<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We cannot control or avoid failures in distributed systems. However, we can control the impact radius of the failure and optimize the time to recover and restore the systems. This can be achieved only by exercising as many failures as possible in the test lab, thus achieving confidence in the system\u2019s resilience.<\/p>\n<h2>Why Chaos Engineering?<\/h2>\n<p>Chaos Engineering is the discipline of experimenting with distributed systems to build confidence in the system\u2019s capability to withstand turbulent conditions in production. Chaos Testing is the deliberate injection of faults or failures into your infrastructure in a controlled manner to test the system\u2019s ability to respond during a failure. This is an effective method to practice, prepare, and prevent or minimize downtime and outages before they occur.<\/p>\n<p>Chaos <a href=\"https:\/\/coforge.site\/cigniti\/blog\/5-effective-ways-to-get-more-out-of-medical-device-testing\/\">testing is one of the effective ways<\/a> to validate a system\u2019s resilience by running failure experiments or fault injections.<\/p>\n<h2>What is an Experiment?<\/h2>\n<p>An experiment is a planned fault injection in a controlled manner. Experiments vary based on the architecture of the systems under test. However, in a distributed system and microservices architecture deployed on the cloud, below are the most common fault injections that must be exercised.<\/p>\n<ul>\n<li><strong>Shutdown<\/strong> the compute engines randomly in an availability zone (or data center)<\/li>\n<li><strong>Outage<\/strong> of an entire region or availability zone.<\/li>\n<li><strong>Resource<\/strong> exhaustion: High CPU, Low Memory, Heavy Disk Usage<\/li>\n<li><strong>Data<\/strong> Service Failure &#8211; Partially deleting a stream of records\/messages across multiple instances to recreate a database-dependent issue.<\/li>\n<li><strong>Network<\/strong> &#8211; Inject latency between services for a select percentage of traffic over a predetermined period.<\/li>\n<li><strong>Code insertion<\/strong>: Adding instructions to the target program and allowing fault injection to occur prior to certain instructions.<\/li>\n<\/ul>\n<p><strong>Principles of Chaos Testing<\/strong><\/p>\n<p><img decoding=\"async\" class=\"alignnone wp-image-16800 size-medium\" src=\"https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/principles-of-chaos-testing-555x80.png\" alt=\"Principles of Chaos Testing\" width=\"555\" height=\"80\" srcset=\"https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/principles-of-chaos-testing-555x80.png 555w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/principles-of-chaos-testing-768x111.png 768w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/principles-of-chaos-testing-833x120.png 833w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/principles-of-chaos-testing-600x87.png 600w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/principles-of-chaos-testing.png 1413w\" sizes=\"(max-width: 555px) 100vw, 555px\" \/><\/p>\n<ol>\n<li><strong>Define the system\u2019s normal behavior:<\/strong> The steady state can be defined as some measurable output like overall throughput, error rates, or the latency of a system that indicates normal behavior. The system\u2019s normal behavior is believed to be acceptable behavior and unexpected behavior. The normal state of the system should be considered the steady state.<\/li>\n<li><strong>Hypothesize about the steady state:<\/strong> The hypothesis defined here will be believed to be the expected output of the experiment. The hypothesis of the experiments should be in line with the objective of Chaos engineering: \u201c<em>the events injected into the system will not result in a change from the steady state of the target system.<\/em>\u201d<\/li>\n<li><strong>Design and run experiments:<\/strong> Identify all the possible failure scenarios in the infrastructure, design failure experiments and run them in a controlled manner, and ensure there is a backout plan for every failure experiment. If a back-out plan is unknown, identify the path to <a href=\"https:\/\/coforge.site\/cigniti\/blog\/software-qa-for-ehr-electronic-health-record-systems\/\">systems recovery and record<\/a> the procedures during the recovery.<\/li>\n<li><strong>Analyse Test Results:<\/strong> Verify if the hypothesis was correct or if there was a change to the system\u2019s expected steady-state behavior. Identity, whether there was any <a href=\"https:\/\/coforge.site\/cigniti\/blog\/user-experience-experience-engineering-business-impact\/\">impact on the service continuity user experience<\/a>, and whether the service is resilient to the failures injected.<\/li>\n<\/ol>\n<p><strong>Test tools comparison<\/strong><\/p>\n<p><strong><img decoding=\"async\" class=\"alignnone wp-image-16801 size-medium\" src=\"https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Test-tools-comparison-530x300.png\" alt=\"Test tools comparison\" width=\"530\" height=\"300\" srcset=\"https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Test-tools-comparison-530x300.png 530w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Test-tools-comparison-768x434.png 768w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Test-tools-comparison-774x438.png 774w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Test-tools-comparison-600x339.png 600w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Test-tools-comparison.png 1987w\" sizes=\"(max-width: 530px) 100vw, 530px\" \/><\/strong><\/p>\n<h2>Best Practises<\/h2>\n<p>Smaller blast radius<strong>:<\/strong> Begin with small experiments to know the unknowns and learn about them. Scale-out the experiments only when we gain confidence. Start with a single compute engine, a container, or a microservice to reduce the potential side effects.<\/p>\n<p><a href=\"https:\/\/coforge.site\/cigniti\/blog\/types-of-performance-testing\/\">Test tool selection<strong>:<\/strong><strong> Perform<\/strong><\/a><strong> a study of the test tools available<\/strong>. Compare the available features and the time and effort required to build your tools. We recommend not picking tools that perform random experiments as it would become difficult to measure the outcome. Use the test tools that perform thoughtful, planned, controlled, safe, and secure experiments.<\/p>\n<p>Exercise first in the Lower environment<strong>:<\/strong> get confidence in the tests, start with staging or development environment. Once the <a href=\"https:\/\/coforge.site\/cigniti\/blog\/manage-your-test-environment-better-using-service-virtualization\/\">tests in these environments<\/a> are complete, move up to production.<\/p>\n<p>Roll Back &amp; Abort planning<strong>:<\/strong> <a href=\"https:\/\/coforge.site\/cigniti\/blog\/regression-testing-strategy-for-business-growth\/\">ensure effective<\/a> planning is exercised to abort any experiment immediately and revert the system or service back to its normal state. If an experiment causes a severe outage, track it carefully and do an analysis to avoid it happening again. If these plans are void or cannot be run, exercise effective root cause analysis to learn further about the outage.<\/p>\n<p><strong>Path to achieve maturity of Chaos Testing:<\/strong><\/p>\n<p><strong><img decoding=\"async\" class=\"alignnone wp-image-16802 size-medium\" src=\"https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Path-to-achieve-maturity-of-Chaos-Testing-555x261.png\" alt=\"maturity of Chaos Testing\" width=\"555\" height=\"261\" srcset=\"https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Path-to-achieve-maturity-of-Chaos-Testing-555x261.png 555w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Path-to-achieve-maturity-of-Chaos-Testing-768x361.png 768w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Path-to-achieve-maturity-of-Chaos-Testing-833x391.png 833w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Path-to-achieve-maturity-of-Chaos-Testing-600x282.png 600w, https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Path-to-achieve-maturity-of-Chaos-Testing.png 2007w\" sizes=\"(max-width: 555px) 100vw, 555px\" \/><\/strong><\/p>\n<h2>Conclusion<\/h2>\n<p>No system is safe from failure or outage. Cloud infrastructure platforms cannot be over-trusted. Every major Cloud infra reported at least one outage in each quarter. We cannot control the failures or outages. You can only control the impact on your customers, employees, partners, and reputation by exercising failures as <a href=\"https:\/\/coforge.site\/cigniti\/blog\/need-hour-security-testing-test-often-test-right\/\">often as possible in the test<\/a> lab, thus identifying the path to your systems&#8217; recovery.<\/p>\n<p>Enterprises building distributed systems must exercise Chaos engineering as part of their resilience strategy. Running Chaos tests in a continuous manner is one of several things that you can do to improve the resiliency of your applications and infrastructure.<\/p>\n<p>Cigniti has built a dedicated <a href=\"https:\/\/coforge.site\/cigniti\/blog\/10-best-reasons-invest-performance-testing\/\">Performance Testing<\/a> CoE that provides solutions around performance testing &amp; engineering for our global clients. We focus on <a href=\"https:\/\/coforge.site\/cigniti\/blog\/choosing-the-right-performance-test-tools-an-indepth-analysis\/\">performing in-depth analysis<\/a> at the component level, dynamic profiling, capacity evaluation, testing, and reporting to help isolate bottlenecks and provide appropriate recommendations.<\/p>\n<p>Schedule a discussion with our <a href=\"https:\/\/coforge.site\/cigniti\/blog\/resilience-rhythms-embracing-chaos-engineering\/\">Chaos Engineering<\/a> and Testing experts to learn more about Chaos Engineering and testing tools for cloud deployment.<\/p>\n<p>Join us for a Fireside Chat on October 12th, 2023, where we&#8217;ll be accompanied by Northern Trust and Gremlin to discuss the art of <a href=\"https:\/\/coforge.site\/cigniti\/blog\/building-resilient-digital-systems-chaos-engineering\/\">Building Resilient Digital Systems Through Chaos Engineering<\/a>. Embrace orchestrated chaos, foster resilience, and be part of our insightful <a href=\"https:\/\/www.cigniti.com\/webinars\/building-resilient-digital-systems-through-chaos-engineering\/?cust_param_01=home-page-bar\" class=\"broken_link\" target=\"_blank\" rel=\"noopener\">Chaos Engineering Fireside Chat dialogue<\/a>.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Modern systems are built on a large scale and operated in a distributed manner. With scale comes complexity, and there are so many ways these large-scale distributed systems can fail. Modern systems built on cloud technologies and microservices architecture have a lot of dependencies on the internet, infrastructure, and services that you do not have [&hellip;]<\/p>\n","protected":false},"author":20,"featured_media":16812,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4043],"tags":[3265,3264,4042,1297,305,498,1481],"ppma_author":[4041],"class_list":["post-16797","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-chaos-engineering","tag-chaos-engineering","tag-chaos-testing","tag-chaos-testing-tools","tag-network-penetration-testing","tag-penetration-testing","tag-security-testing","tag-security-testing-services"],"authors":[{"term_id":4041,"user_id":0,"is_guest":1,"slug":"jitendra-nath-lella","display_name":"Jitendra Nath Lella","avatar_url":{"url":"https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Jitendra-Nath-Lella.jpg","url2x":"https:\/\/coforge.site\/cigniti\/blog\/wp-content\/uploads\/Jitendra-Nath-Lella.jpg"},"author_category":"","user_url":"","last_name":"","first_name":"","job_title":"","description":"Jitendra Nath Lella is a Senior Architect at Cigniti Technologies and is Certified Chaos Engineering practitioner. He is into the practice of Non-Functional testing for over 17 years. He is specialized in building &amp; implementing test strategy\u2019s for organizations that build \/ migrate data centres on to the cloud. Also, his expertise is into simulating heavy user load tests of more than 200K users."}],"_links":{"self":[{"href":"https:\/\/coforge.site\/cigniti\/blog\/wp-json\/wp\/v2\/posts\/16797","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/coforge.site\/cigniti\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/coforge.site\/cigniti\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/coforge.site\/cigniti\/blog\/wp-json\/wp\/v2\/users\/20"}],"replies":[{"embeddable":true,"href":"https:\/\/coforge.site\/cigniti\/blog\/wp-json\/wp\/v2\/comments?post=16797"}],"version-history":[{"count":0,"href":"https:\/\/coforge.site\/cigniti\/blog\/wp-json\/wp\/v2\/posts\/16797\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/coforge.site\/cigniti\/blog\/wp-json\/wp\/v2\/media\/16812"}],"wp:attachment":[{"href":"https:\/\/coforge.site\/cigniti\/blog\/wp-json\/wp\/v2\/media?parent=16797"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/coforge.site\/cigniti\/blog\/wp-json\/wp\/v2\/categories?post=16797"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/coforge.site\/cigniti\/blog\/wp-json\/wp\/v2\/tags?post=16797"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/coforge.site\/cigniti\/blog\/wp-json\/wp\/v2\/ppma_author?post=16797"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}