Wednesday, July 3, 2013

WebSphere Test Practices. Part 1 of 3: Challenges.



This next retrospective is related to Kevin’s great post on Automation and Cloud for System Verification Test and broken into 3 parts: 
1.  WebSphere Test Challenges
2.  WebSphere Test Transformation
3. 
How does this related to DevOps and Continuous Delivery 

Similar to Rational System Verification Test the WebSphere Development Organization found the usage of Patterns and Cloud to be of great value.  Kevin’s scenario focused on automating very complicated deployments that allowed for the execution of test scenarios that otherwise could not be contained.  This next story focuses on the elasticity of Cloud to enable high volume automated test execution, or Continuous Test, as a part of the development and build process. 

Overview and scope
The WebSphere organization, feature set and code base is relatively large.  From an organization perspective there were over 600 developers and 200 engineers involved in test and release engineering.  The infrastructure to support this was fairly significant with the test organization owning and maintaining around 3000 + cores , 500+ z/OS systems on 10+ LPARs, 45+ iSeries LPARs.   The continuous test effort now runs over 1.7 million functional test cases every day, and over 16+ hours of continuous security variations.  25+ OS variations and at least 8 Database variations are needed to thoroughly perform a meaningful regression suite.  The delivery process for the product was broken into a sequence of phases design, development, functional test, system test, performance test, media test …  The point here is that delivery of such a large offering requires a significant amount of effort to test, took a long time and process changes were challenging. 

Challenges  and objectives
What happened within WebSphere is not unique.  Over the initial 5 or so years of development the product moved very quickly with an ever increasing number of resources available as success was demonstrated.  This trend however reached a tipping point where cost of maintaining and testing the current feature set competes with the ability to delivery on new customer requirements.   We had reached a point where regardless of the resources we applied testing the product took a long time. As often happens this feeds into itself as you attempt to fit more and more ‘must have’ content into the current release.  It was time to optimize this process … lets have a slightly closer look at the costs we were absorbing.   

The cost of a regression is exponentially proportional to the time it takes to detect that regression.  This is because it is easy to fix a regression when it is introduced since the change is fresh, does not have other changes built on top of it and the people involved are available.  Using our waterfall style delivery process it on average took us 3 months to find a regression.  This needed to come down to withn a single day.    

The time that it would take us to execute a functional regression of the Application Server was 6 weeks with around 70 Full Time Equivalent employees.  This process consistently bleed over into other phases and over 75% of our Integration or System Verification Scenarios would be blocked by a basic functional failure at some point in time.  We had to get to the point where we could execute a functional regression of the application server with little to no human cost, and within hours not weeks. 

We were hardware constrained.  We had a lot of machines but try finding one to use.  Though our lab showed only 6% of our infrastructure was in use at any given point in time it was always assigned. Teams were spending time justifying new hardware requests, overestimating what they needed and there was a bit of hording going on.  We needed self-service access to infrastructure and monitoring to govern misuse. 

What else was costing us time and money?  Organizational boundaries.  We had many organizations responsible for a particular delivery.  Development teams, Functional Test teams, System Persona teams, Performance Teams, Hardware teams, Test Automation Teams … the list goes on and on.  As code transitions between teams there is a significant cost.  Certain teams become bottlenecks and often one team has a different set of objectives or incentives than another so do not align.  Development would throw code over the wall and see it as ‘Test’s’ job to test it … Test would not gather enough information when things did not work … teams were blocked by a lack of infrastructure … 4 different automation infrastructures custom built for specific purposes.  These boundaries existed for some good reasons but we had reached a point where they were slowing us down too much and could not function in a world where resources were shrinking not growing. 

In the next section we will look at some of the things we did to address these problems. 

1 comment:

  1. Love the bit on "The cost of a regression is exponentially proportional to the time it takes to detect that regression. ". Any data you collected to support that?

    ReplyDelete