The movement of IT and systems management to the end-to-end business service value level has been a long time in coming. Yet the need has never been higher. Enterprises and on-demand application providers alike need to predict how systems will behave under a variety of conditions.
Rather than losing control to ever-increasing complexity -- and gaining less and less insight into the root causes of problematic applications and services -- operators must gain the ability to predict and prevent threats to the performance of their applications and services. Firefighting against applications performance degradation in a dynamic service-oriented architecture (SOA) just won't cut it.
By adding real-time analytics to their systems management practices, IT operators can determine the normal state of how systems should be performing. Then, by measuring the characteristics of systems under many conditions over time, administrators can gain predictive insights into their entire operations, based on a business services-level of performance and demand. They can stay ahead of complexity, and therefore contain the costs of ongoing high-performance applications delivery.
I recently had a podcast discussion with Mazda Marvasti, the CTO of Integrien Corp., on managing complexity by leveraging probabilistic systems management and remediation. I learned that Integrien's Alive suite uses probabilistic analysis to predict IT systems problems before costly applications outages. Furthermore, I received some details on the next Alive 6.0 release in Q4 of this year.
Here are some excerpts:
Can you give us some sense of the direction that the major new offerings within the Alive product set will take?Read a full transcript of the discussion. Listen to the podcast. Sponsor: Integrien Corp.
Basically, we have three pillars that the product is based on. First is usability. That's a particular pet peeve of mine. I didn't find any of the applications out there very usable. We have spent a lot of time working with customers and working with different operations groups. ... The second piece is interoperability. The majority of the organizations that we go to already have a whole bunch of systems, whether it be data collection systems, event management systems, or configuration management databases, etc.
Our product absolutely needs to leverage those investments -- and they are leveragable. But even those investments in their silos don’t produce as much benefit to the customer as a product like ours going in there and utilizing all of that data that they have in there, and bringing out the information that’s locked within it.
The third piece is analytics. What we have in the product coming out is scalability to 100,000 servers. We've kind of gone wild on the scalability side, because we are designing for the future. Nobody that I know of right now has that kind of a scale, except maybe Google, but theirs' is basically the same thing replicated thousands of times over, which is different than the enterprises we deal with, like banks or health-care organizations.
A single four-processor Xeon box, with Alive installed on it, can run real-time analytics for up to 100,000 devices. That’s the level of scale we're talking about. In terms of analytics, we've got three new pieces coming out, and basically every event we send out is a predictive event. It’s going to tell you this event occurred, and then this other set of events have a certain probability within a certain timeframe to occur.
Not only that, but then we can match it to what we call our "finger printing." Our finger printing is a pattern-matching technology that allows us to look at patterns of events and formulate a particular problem. It indicates particular problems and those become the predictive alerts to other problems.
Now, with SOA and virtualization moving into application-development and data-center automation, there is a tremendous amount of complexity in the operations arena. You can’t have the people who used to have the "tribal knowledge" in their head determining where the problems are coming from or what the issues are.
The problems and the complexity have gone beyond the capability of people just sitting there in front of screens of data, trying to make sense out of it. So, as we gained efficiency from application development, we need consistency of performance and availability, but all of this added to the complexity of managing the data center.
That’s how the evolution of the data center went from being totally deterministic, meaning that you knew every variable, could measure it, and had very specific rules telling you if certain things happened, and what they were and what they meant -- all the way to a non-deterministic era, which we are in right now.
Now, you can't possibly know all the variables, and the rules that you come up with today may be invalid tomorrow, all just because of change that has gone on in your environment. So, you cannot use the same techniques that you used 10 or 15 years ago to manage your operations today. Yet that’s what the current tools are doing. They are just more of the same, and that’s not meeting the requirements of the operations center anymore.
I’ve been working on these types of problems for the past 18 years. Since graduate school, I’ve been analyzing data extraction of information from disparate data. I went to work for Ford and General Motors -- really large environments. Back then, it was client-servers and how those environments were being managed. I could see the impending complexity, because I saw the level of pressure that there was on application developers to develop more reusable code and to develop faster with higher quality.
The run book is missing that information. The run book only has the information on how to clean it up after an accident happens.
That’s the missing piece in the operations arena. Part of the challenge for our company is getting the operations folks to start thinking in a different fashion. You can do it a little at a time. It doesn’t have to be a complete shift in one fell swoop, but it does require that change in mentality. Now that I am actually forewarned about something, how do I prevent it, as opposed to cleaning up after it happens.