Tuesday, June 22, 2021

How financial firms are blazing a trail to more predictive and resilient operations come what may

The last few years have certainly highlighted the need for businesses of all kinds to build up their operational resilience. With a rising tide of pandemic waves, high-level cybersecurity incidents, frequent technology failures, and a host of natural disasters -- there’s been plenty to protect against.

As businesses become more digital and dependent upon end-to-end ecosystems of connected services, the responsibility for protecting critical business processes has clearly shifted. It’s no longer just a task for IT and security managers but has become top-of-mind for line-of-business owners, too.

Stay with us now as BriefingsDirect explores new ways that those responsible for business processes specifically in the financial sector are successfully leading the path to avoiding and mitigating the impact and damage from these myriad threats.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy.

To learn more about the latest in rapidly beefing-up operational resilience by bellwether finance companies, BriefingsDirect welcomes Steve Yon, Executive Director of the EY ServiceNow Practice, and Sean Culbert, Financial Services Principal at EY. The discussion is moderated by Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Sean, how have the risks modern digital businesses face changed over the past decade? Why are financial firms at the vanguard of identifying and heading off these pervasive risks?

Culbert: The category of financial firms forms a broad scope of types. The risks for a consumer bank, for example, are going to be different than the risks for an investment bank or from a broker-dealer. But they all have some common threads. Those include the expectation to be always-on, at the edge, and able to get to your data in a reliable and secure way.

Culbert

There’s also the need for integration across the ecosystem. Unlike product sets before, such as in retail brokerage or insurance, customers expect to be brought together in one cohesive services view. That includes more integration points and more application types.

This all needs to be on the edge and always-on, even as it includes, increasingly, reliance on third-party providers. They need to walk in step with the financial institutions in a way that they can ensure reliability. In certain cases, there’s a learning curve involved, and we’re still coming up that curve.

It remains a shifting set of expectations to the edge. It’s different by category, but the themes of integrated product lines -- and being able to move across those product lines and integrate with third-parties – has certainly created complexity.

Gardner: Steve, when you’re a bank or a financial institution that finds itself in the headlines for bad things, that is immediately damaging for your reputation and your brands. How are banks and other financial organizations trying to be rapid in their response in order to keep out of the headlines?

Interconnected, system-wide security

Yon: It’s not just about having the wrong headline on the front cover of American Banker. As Sean said, the taxonomy of all these services is becoming interrelated. The suppliers tend to leverage the same services.

Yon

Products and services tend to cross different firms. The complexity of the financial institution space right now is high. If something starts to falter -- because everything is interconnected -- it could have a systemic effect, which is what we saw several years ago that brought about Dodd-Frank regulations.

So having a good understanding of how to measure and get telemetry on that complex makeup is important, especially in financial institutions. It’s about trust. You need to have confidence in where your money is and how things are going. There’s a certain expectation that must happen. You must deal with that despite mounting complexity. The notion of resiliency is critical to a brand promise -- or customers are going to leave.

One, you should contain your own issues. But the Fed is going to worry about it if it becomes broad because of the nature of how these firms are tied together. It’s increasingly important -- not only from a brand perspective of maintaining trust and confidence with your clients -- but also from a systemic nature; of what it could do to the economy if you don’t have good reads on what’s going on with support of your critical business services.

Gardner: Sean, the words operational resilience come with a regulatory overtone. But how do you define it?

The operational resilience pyramid

Culbert: We begin with the notion of a service. Resilience is measured, monitored, and managed around the availability, scalability, reliability, and security of that service. Understanding what the service is from an end-to-end perspective, how it enters and exits the institution, is the center to our universe.

Around that we have inbound threats to operational resilience. From the threat side, you want the capability to withstand a robust set of inbound threats. And for us, one of the important things that has changed in the last 10 years is the sophistication and complexity of the threats. And the prevalence of them, quite frankly.

We have COVID, we have proliferation of very sophisticated cyber attacks that weren't around 10 years ago. Geopolitically, we're all aware of tensions, and weather events have become more prevalent. It's a wide scope of inbound threats.

If you look at the four major threat categories we work with -- weather, cyber, geopolitical, and pandemics -- pick any one of those and there has been a significant change in those categories. We have COVID, we have proliferation of very sophisticated cyber attacks that weren’t around 10 years ago, often due to leaks from government institutions. Geopolitically, we’re all aware of tensions, and weather events have become more prevalent. It’s a wide scope of inbound threats.

And on the outbound side, businesses need the capability to not only report on those things, but to make decisions about how to prevent them. There’s a hierarchy in operational resilience. Can you remediate it? Can you fix it? Then, once it’s been detected, how can minimize the damage. At the top of the pyramid, can you prevent it before it hits?

So, there’s been a broad scope of threats against a broader scope of service assets that need to be managed with remediation. That was the heritage, but now it’s more about detection and prevention.

Gardner: And to be proactive and preventative, operational resilience must be inclusive across the organization. It’s not just one group of people in a back office somewhere. The responsibility has shifted to more people -- and with a different level of ownership.

What’s changed over the past decade in terms of who’s responsible and how you foster a culture of operational resiliency?

Bearing responsibility for services

Culbert: The anchor point is the service. And services are processes: It’s technology, facilities, third parties, and people. The hard-working people in each one of those silos all have their own view of the world -- but the services are owned by the business. What we’ve seen in recognition of that is that the responsibility for sustaining those services falls with the first line of business [the line of business interacting with consumers and vendors at the transaction level].

Yon: There are a couple of ways to look at it. One, as Sean was talking about, the lines of defense and the evolution of risk has been divvied up. The responsibilities have had line-of-sight ownership over certain sets of accountabilities. But you also have triangulation from others needing to inspect and audit those things as well.

The time is right for the new type of solution that we’re talking about now. One, because the nature of the world has gotten more complex. Two, the technology has caught up with those requirements.

The move within the tech stack has been to become more utility-based, service-oriented, and objectified. The capability to get signals on how everything is operating, and its status within that universe of tech, has become a lot easier. And with the technology now being able to integrate across platforms and operate at the service level -- versus at the component level – it provides a view that would have been very hard to synthesize just a few years ago.

What we’re seeing is a big shot in the arm to the power of what a typical risk resilience compliance team can be exposed to. They can manage their responsibilities at a much greater level.

Before they would have had to develop business continuity strategies and plans to know what to do in the event of a fault or a disruption. And when those things come out, the three-ring binders, the war room gets assembled and people start to figure out what to do. They start running the playbook.

What we're seeing is a big shot in the arm to the power of what a typical risk resilience compliance team can be exposed to. They can manage their responsibilities at a mch greater level. 

The problem with that is that while they’re running the playbook, the fault has occurred, the destruction has happened, and the clock is ticking for all those impacts. The second-order consequences of the problem are starting to amass with respect to value destruction, brand reputational destruction, as well as whatever customer impacts there might be.

But now, because of technology and moving toward Internet of things (IoT) thinking across assets, people, facilities, and third-party services, technology can self-declare their state. That data can be synthesized to say, “Okay, I can start to pick up a signal that’s telling me that a fault is inbound.” Or something looks like it’s falling out of the control thresholds that they have.

That tech now gives me the capability to get out in front of something. That would be almost unheard-of years ago. The nexus of tech, need, and complexity are all hitting right now. That means we’re moving and pivoting to a new type of solution rising out of the field.

Gardner: You know, many times we’ve seen such trends happen first in finance and then percolate out to the rest of the economy. What’s happened recently with banking supervision, regulations, and principles of operational resilience?

Financial sector leads the way

Yon: There are similar forms of pressure coming from all regulatory-intense industries. Finance is a key one, but there’s also power, utilities, oil, and gas. The trend is happening primarily first in regulatory-intensive industries.

Culbert: A couple years ago, the Bank of England and the Prudential Regulation Authority (PRA) put out a consultation paper that was probably most prescriptive out of the UK. We have the equivalent over here in the US around expectations for operational resiliency. And that just made its way into policy or law. For the most part, on a principles basis, we all share a common philosophy in terms of what’s prudent.

A lot of the major institutions, the ones we deal with, have looked at those major tenets in these policies and have said they will be practiced. And there are four fundamental areas that the institutions must focus on.

One is, can it declare and describe its critical business services? Does it have threshold parameters logic assigned to those services so that it knows how far it can go before it sustains damage across several different categories? Are the assets that support those services known and mapped? Are they in a place where we can point to them and point to the health of them? If there’s an incident, can they collaborate around the sustaining of those assets?

As I said earlier, those assets generally fall into small categories: people, facilities, third parties, and technology. And, finally, do you have the tools in place to keep those services within those tolerance parameters and have other alerting systems to let you know which of the assets may well be failing you, if the services are at risk.

That’s a lay-person, high-level description of the Bank of England policy on operational risks for today’s Financial Management Information Systems (FMIS). Thematically most of the institutions are focusing on those four areas, along with having credible and actionable testing schemes to simulate disruptions on the inbound side.

In the US, Dodd-Frank mandated that institutions declare which of those services could disrupt critical operations and, if those operations were disrupted, could they in turn disrupt the general economy. The operational resilience rules and regulations fall back on that. So, now that you know what they are, can you risk-rate them based on the priorities of the bank and its counterparties? Can you manage them correctly? That’s the letter-of-the-law-type regulation here. In Japan, it’s more credential-based regulation like the Bank of England. It all falls into those common categories.

Gardner: Now that we understand the stakes and imperatives, we also know that the speed of business has only increased. So has the speed of expectations for end consumers. The need to cut time to discovery of the problems and to find root causes also must be as fast as possible.

How should banks and other financial institutions get out in front of this? How do we help organizations move faster to their adoption, transform digitally, and be more resilient to head off problems fast?

Preventative focus increases

Yon: Once there’s clarity around the shift in the goals, knowing it’s not good enough to just be able to know what to do in the event of a fault or a potential disruption, the expectation becomes the proof to regulatory bodies and to your clients that they should trust you. You must prove that you can withstand and absorb that potential disruption without impact to anybody else downstream. Once people get their head around the nature of the expectation-shifting to being a lot more preventative versus reactive, the speeds and feeds by which they’re managing those things become a lot easier to deal with.

You'd get the phone call at 3 a.m. that a critical business service was down. You'd have the tech phone call that people are trying to figure out what happened. That lack of speed killed because you had to figure a lot of things out while the clock was ticking. But now, you're allowing yourself time to figure things out.

Back when I was running the technology at a super-regional bank, you’d get the phone call at 3 a.m. that a critical business service was down. You’d have the tech phone call that people are trying to figure out what happened because they started to notice at the help desk that a number of clients and customers were complaining. The clock had been ticking before 3 a.m. when I got the call. And so, by now, by that time, those clients are upset.

Yet we were spending our time trying to figure out what happened and where. What’s the overall impact? Are there other second-order impacts because of the nature of the issue? Are other services disrupted as well? Again, it gets back to the complexity factor. There are interrelationships between the various components that make up any service. Those services are shared because that’s how it is. People lean on those things -- and that’s the risk you take.

Before, the lack of speed literally killed because you had to figure a lot of those things out while the clock was ticking and the impact was going on. But now, you’re allowing yourself time to figure things out. That’s what we call a decision-support system. You want to alert ahead of time to ensure that you understand the true blast area of what the potential destruction is going to be.

Secondly, can I spin up the right level of communications so that everybody who could be affected knows about it? And thirdly, can I now get the right people on the call -- versus hunting and pecking to determine who has a problem on the fly at 3 a.m.?

The nature of having speed is when you deal with an issue by buying time for firms to deal with the thing intelligently versus in a shotgun approach and without truly understanding the nature of the impact until the next day.

Gardner: Sean, it sounds like operation resiliency is something that never stops. It’s an ongoing process. That’s what buys you the time because you’re always trying to anticipate. Is that the right way to look at it?

Culbert: It absolutely is the way to look at it. A time objective may be specific to the type of service, and obviously it’s going to be different from a consumer bank to a broker-dealer. You will have a time objective attached to a service, but is that a critical service that, if disrupted, could further disrupt critical operations that could then disrupt the real economy? That’s come into focus in the last 10 years. It has forced people to think through: If you were if a broker-dealer and you couldn’t meet your hedge fund positions, or if you were a consumer bank and you couldn’t get folks their paychecks, does that put people in financial peril?

These involve very different processes and have very different outcomes. But each has a tolerance of filling in the blank time. So now it’s just more of a matter of being accountable for those times. There are two things: There’s the customer expectation that you won’t reach those tolerances and be able to meet the time objective to meet the customers’ needs.

And the second is that technology has made it more manageable as the domino or contagion effect of one service tipping over another one. So now it’s not just, “Is your service ready to go within its objective of half an hour?” It’s about the knock-on effect to other services as well.

So, it’s become a lot more correlated, and it’s become regional. Something that might be a critical service in one business, might not be in another -- or in one region, might not be in another. So, it’s become more of a multidimensional management problem in terms of categorically specific time objectives against specific geographies, and against the specific regulations that overhang the whole thing.

Gardner: Steve, you mentioned earlier about taking the call at 3 a.m. It seems to me that we have a different way of looking at this now -- not just taking the call but making the call. What’s the difference between taking the call and making the call? How does that help us prepare for better operation resiliency?

Make the call, don’t take the call

Yon: It’s a fun way of looking a day in the life of your chief resiliency officer or chief risk officer (CRO) and how it could go when something bad happens. So, you could take the call from the CEO or someone from the board as they wonder why something is failing. What are you going to do about it?

You’re caught on your heels trying to figure out what was going on, versus making the call to the CEO or the board member to let them know, “Hey, these were the potential disruptions that the firm was facing today. And this is how we weathered through it without incident and without damaging service operations or suffering service operations that would have been unacceptable.”

We like to think of it as not only trying to prevent the impact to the clients but also from the possibility of a systemic problem. It could potentially increase the lifespan of a CRO by showing they can be responsible for the firm’s up-time, versus just answer questions post-disruption. It provides a little bit of levity but it’s also a truth that there are more than just the consequences to the clients, but also to those people responsible for that function within the firm.

Gardner: Many leading-edge organizations have been doing digital transformation for some time. We’re certainly in the thick of digital transformation now after the COVID requirements of doing everything digitally rather than in person.

But when it comes to finance and the services that we’re describing -- the interconnections in the interdependencies -- there are cyber resiliency requirements that cut across organizational boundaries. Having a moat around your organization, for example, is no longer enough.

What is it about the way that ServiceNow and EY are coming together that helps make operational resiliency an ongoing process possible?

Digital transformation opens access

Yon: There are two components. You need to ask yourself, “What needs to be true for the outcome that we’re talking about to be valid?” From a supply-side, what needs to be true is, “Do I have good signal and telemetry across all the components and assets of resources that would pose a threat or a cause for a threat to happen from a down service?”

With the move to digital transformation, more assets and resources that compose any organization are now able to be accessed. That means the state of any particular asset, in terms of its preferential operating model, are going to be known.

With the move to digital transformation, more assets and resources that compose any organization are now able to be accessed. That means the state of any particular asset, in terms of its preferential operating model, are going to be known. I need to have that data and that’s what digital transformation provides.

Secondly, I need a platform that has wide integration capabilities and that has workflow at its core. Can I perform business logic and conditional synthesis to interpret the signals that are coming from all these different systems?

That’s what’s great about ServiceNow -- there hasn’t been anything that it hasn’t been able to integrate with. Then it comes down to, “Okay, do I understand the nature of what it is I’m truly looking for as a business service and how it’s constructed?” Once I do that, I’m able to capture that control, if you will, determine its threshold, see that there’s a trigger, and then drive the workflows to get something done.

For a hypothetical example, we’ve had an event so that we’re losing the trading floor in city A, therefore I know that I need to bring city B and its employees online and to make them active so I can get that up and running. ServiceNow can drive that all automatically, within the Now Platform itself, or drive a human to provide the approvals or notifications to drive the workflows as part of your business continuity plan (BCP) going forward. You will know what to do by being able to detect and interpret the signals, and then based on that, act on it.

That’s what ServiceNow brings to make the solution complete. I need to know what that service construction is and what it means within the firm itself. And that’s where EY comes to the table, and I’ll ask Sean to talk about that.

Culbert: ServiceNow brings to the table what we need to scale and integrate in a logical and straightforward way. Without having workflows that are cross-silo and cross-product at scale -- and with solid integration of capabilities – this just won’t happen.

When we start talking about the signals from everywhere against all the services -- it’s a sprawl. From an implementation perspective, it feels like it’s not implementable.

The regulatory burden requires focus on what’s most important, and why it’s most important to the market, the balance sheet, and the customers. And that’s not for the 300 services, but for the one or two dozen services that are important. Knowing that gives us a big step forward by being able to scope out the ServiceNow implementation.

And from there, we can determine what dimensions associated with that service we should be capturing on a real-time basis. To progress from remediation to detection on to prevention, we must be judicious of what signals we’re tracking. We must be correct.

We have the requirement and obligation to declare and describe what is critical using a scalable and integrable technology, which is ServiceNow. That’s the big step forward.

Yon: The Now platform also helps us to be fast. If you look under the hood of most firms, you’ll find ServiceNow is already there. You’ll see that there’s already been work done in the risk management area. They already know the concepts and what it means to deal with policies and controls, as well as the triggers and simulations. They have IT  and other assets under management, and they know what a configuration management database (CMDB) is.

These are all accelerants that not only provide scale to get something done but provide speed because so many of these assets and service components are already identified. Then it’s just a matter of associating them correctly and calibrating it to what’s really important so you don’t end up with a science fair integration project.

Gardner: What I’m still struggling to thread together is how the EY ServiceNow alliance operational resiliency solution becomes proactive as an early warning system. Explain to me how you’re able to implement this solution in such a way that you’re going to get those signals before the crisis reaches a crescendo.

Tracking and recognizing faults

Yon: Let’s first talk about EY and how it comes with an understanding from the industry of what good looks like with respect to what a critical business service needs to be. We’re able to hone down to talking about payments or trading. This maps the deconstruction of that service, which we also bring as an accelerant.

We know what it looks like -- all the different resources, assets, and procedures that make that critical service active. Then, within ServiceNow, it manages and exposes those assets. We can associate those things in the tool relatively quickly. We can identify the signal that we’re looking to calibrate on.

Then, based on what ServiceNow knows how to do, I can put a control parameter on this service or component within the threshold. It then gives me an indication whether something might be approaching a fault condition. We basically look at all the different governance, risk management, and compliance (GRC) leading indicators and put telemetry around those things when, for example, it looks like my trading volume is starting to drop off.

Based on what ServiceNow knows how to do, I can put a control parameter on this service or component within the threshold. It then gives me an indication whether something might be approaching a fault condition.

Long before it drops to zero, is there something going on elsewhere? It delivers up all the signals about the possible dimensions that can indicate something is not operating per its normal expected behavior. That data is then captured, synthesized, and displayed either within ServiceNow or it is automated to start running its own tests to determine what’s valid.

But at the very least, the people responsible are alerted that something looks amiss. It’s not operating within the control thresholds already set up within ServiceNow against those assets. This gives people time to then say, “Okay, am I looking at a potential problem here? Or am I just looking at a blip and it’s nothing to worry about?”

Gardner: It sounds like there’s an ongoing learning process and a data-gathering process. Are we building a constant mode of learning and automation of workflows? Do we do get a whole greater than the sum of the parts after a while?

Culbert: The answer is yes and yes. There’s learning and there’s automation. We bring to the table some highly effective regulatory risk models. There’s a five-pillar model that we’ve used where market and regulatory intelligence feeds risk management, surveillance, analysis, and ultimately policy enforcement.

And how the five pillars work together within ServiceNow -- it works together within the business processes within the organization. That’s where we get that intelligence feeding, risk feeding, surveillance analysis, and enforcement. That workflow is the differentiator, to allow rapid understanding of whether it’s an immediate risk or concentrating risk.

And obviously, no one is going to be 100 percent perfect, but having context and perspective on the origin of the risk helps determine whether it’s a new risk -- something that’s going to create a lot of volatility – or whether it’s something the institution has faced before.

We rationalize that risk -- and, more importantly, rationalize the lack of a risk – to know at the onset if it’s a false positive. It’s an essential market and regulatory intelligence mechanism. Are they feeding us only the stuff that’s really important?

Our risk models tell us that. That risk model usually takes on a couple of different flavors. One flavor is similar to a FICO score. So, have you seen the risk? Have you seen it before? It is characterizable by the words coming from it and its management in the past.

And then some models are more akin to a bar calculator. What kind of volatility is this risk going to bring to the bank? Is it somebody that’s recreationally trying to get into the bank, or is it a state actor?

Once the false-positive gets escalated and disposed of -- if it’s, in fact, a false positive – are we able to plug it into something robust enough to surveil for where that risk is headed? That’s the only way to get out in front of it.

The next phase of the analysis says, “Okay, who should we talk to about this? How do we communicate that this is bigger than a red box, much bigger than a red box, a real crisis-type risk? What form does that communication take? Is it a full-blown crisis management communication? Is it a standing management communication or protocol?”

We take that affected function and very quickly understand the health or the resiliency of other impacted functions. We use our own proprietary model. It helps to shift from primary states to alternative states.

And then ultimately, this goes to ServiceNow, so we take that affected function and very quickly understand the health or the resiliency of other impacted functions. We use our own propriety model. It’s a military model used for nuclear power plants, and it helps to shift from primary states to alternative states, as well as to contingency and emergency states.

At the end, the person who oversees policy enforcement must gain the tools to understand where they should be fixing the primary state issue or moving on from it. They must know to step aside or shift into an emergency state.

From our perspective, it is constant learning. But there are fundamental pillars that these events flow through that deliver the problem to the right person and give that person options for minimizing the risk.

Gardner: Steve, do we have any examples or use cases that illustrate how alerting the right people with the right skills at the right time is an essential part of resuming critical business services or heading off the damage?

Rule out retirement risks

Yon: Without naming names, we have a client within Europe, the Middle East and Africa (EMEA) we can look at. One of the things the pandemic brought to light is the need to know our posture to continuing to operate the way we want. Getting back to integration and integrability, where are we going to get a lot of that information for personnel from? Workday, their human resources (HR) system of record, of course.

Now, they had a critical business service owner who was going to be retiring. That sounds great. That’s wonderful to hear. But one of the valid things for this critical business service to be considered operating in its normal state is to check for an owner. Who will cut through the issues and process and lead going forward?

If there isn’t an owner identified for the service, I would be considered at risk for this service. It may not be capable of maintaining its continuity. So, here’s a simple use case where someone could be looking at a trigger from Workday that asks if this leadership person is still in the role and active.

Is there a control around identifying if they are going to become inactive within x number of months’ time? If so, get on that because the regulators will look at these processes potentially being out of control.

There’s a simple use case that has nothing to do with technology but shows the integrability of ServiceNow into another system of record. It turns ServiceNow into a decision-support platform that drives the right actions and orchestrates timely actions -- not only to detect a disruption but anything else considered valid as a future risk. Such alerts give the time to get it taken care of before a fault happens.

Gardner: The EY ServiceNow alliance operational resilience solution is under the covers but it’s powering leaders’ ability to be out in front of problems. How does the solution enable various levels of leadership personas, even though they might not even know it’s this solution they’re reacting to?

Leadership roles evolve

Culbert: That’s a great question. For the last six to seven years, we’ve all heard about the shift from the second to the first line of primary ownership in the private sector. I’ve heard many occasions for our first line business manager saying, “You know, if it is my job, first I need to know what the scope of my responsibilities are and the tools to do my job.” And that persona of the frontline manager having good data, that’s not a false positive. It’s not eating at his or her ability to make money. It’s providing them with options of where to go to minimize the issue.

The personas are clearly evolving. It was difficult for risk managers to move solidly into the first line without these types of tools. And there were interim management levels, too. Someone who sat between the first and the second line -- level 1.5. or line 1.5. And it’s clearly pushing into the first line. How do they know their own scope as relates to the risk to the services?

Now there’s a tool that these personas can use to be not only be responsible for risk but responsive as well. And that’s a big thing in terms of the solution design. With ServiceNow over the last several years, if the base data is correctly managed, then being able to reconfigure the data and recalibrate the threshold logic to accommodate a certain persona is not a coding exercise. It’s a no-code step forward to say, “Okay, this is now the new role and scope, and that role and scope will be enabled in this way.” And this power is going to direct the latest signals and options.

But it’s all about the definition of a service. Do we all agree end-to-end what it is, and the definition of the persona? Do we all understand who’s accountable and who’s responsible? Those two things are coming together with a new set of tools that are right and correct.

Yon: Just to go back to the call at 3 a.m., that was a tech call. But typically, what happens is there’s also going to be the business call. So, one of the issues we’re also solving with ServiceNow is in one system we manage the nature of information irrespective of what your persona is. You have a view of risk that can be tailored to what it is that you care about. And all the data is congruent back and forth.

It becomes a lot more efficient and accurate for firms to manage the nature of understanding on what things are when it’s not just the tech community talking. The business community wants to know what’s happening – and what’s next? And then someone can translate in between. This is a real-time way for all those personas to become a line around the nature of the issue with respect to their perspective.

Gardner: I really look forward to the next in our series of discussions around operational resilience because we’re going to learn more about the May announcement of this solution.

But as we close out today’s discussion, let’s look to the future. We mentioned earlier that almost any highly regulated industry will be facing similar requirements. Where does this go next?

It seems to me that the more things like machine learning (ML) and artificial intelligence (AI) analyze the many sources of data, they will make it even more powerful. What should we look for in terms of even more powerful implementations?

AI to add power to the equation

Culbert: When you set up the framework correctly, you can apply AI to the thinning out of false positives and for tagging certain events as credible risk events or not credible risk events. AI can also to be used to direct these signals to the right decision makers. But instead of taking the human analyst out of the equation, AI is going to help us. You can’t do it without that framework.

Yon: When you enable these different sets of data coming in for AI, you start to say, “Okay, what do I want the picture to look like in my ability to simulate these things?” It all goes up, especially using ServiceNow.

But back to the comment on complexity and the fact that suppliers don’t just supply one client, they connect to many. As this starts to take hold in the regulated industries -- and it becomes more of an expectation for a supplier to be able to operate this way and provide these signals, integration points, telemetry, and transparency that people expect -- anybody else trying to lever into this is going to get the lift and the benefit from suppliers who realize that the nature of playing in this game just went up. Those benefits become available to a much broader landscape of industries and for those suppliers.

Gardner: When we put two and two together, we come up with a greater sum. We’re going to be able to deal rapidly with the known knowns, as well as be better prepared for the unknown unknowns. So that’s an important characteristic for a much brighter future -- even if we hit another unfortunate series of risk-filled years such as we’ve just suffered.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: ServiceNow and EY.

You may also be interested in:

Friday, June 4, 2021

How API security provides a killer use case for ML and AI


While the use of machine learning (ML) and artificial intelligence (AI) for IT security may not be new, the extent to which data-driven analytics can detect and thwart nefarious activities is still in its infancy.

As we’ve recently discussed here on BriefingsDirect, an expanding universe of interdependent application programming interfaces (APIs) forms a new and complex threat vector that strikes at the heart of digital business.


How will ML and AI form the next best security solution for APIs across their dynamic and often uncharted use in myriad apps and services? Stay with us now as we answer that question by exploring how advanced big data analytics forms a powerful and comprehensive means to track, understand, and model safe APIs use.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy.

To learn how AI makes APIs secure and more resilient across their life cycles and ecosystems, BriefingsDirect welcomes Ravi Guntur, Head of Machine Learning and Artificial Intelligence at Traceable.ai. The interview is moderated by Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Why does API security provide such a perfect use case for the strengths of ML and AI? Why do these all come together so well?

Guntur: When you look at the strengths of ML, the biggest strength is to process data at scale. And newer applications have taken a turn in the form of API-driven applications.

Large pieces of applications have been broken down into smaller pieces, and these smaller pieces are being exposed as even smaller applications in themselves. To process the information going between all these applications, to monitor what activity is going on, the scale at which you need to deal with them has gone up many fold. That’s the reason why ML algorithms form the best-suited class of algorithms to deal with the challenges we face with API-driven applications.

Gardner: Given the scale and complexity of the app security problem, what makes the older approaches to security wanting? Why don’t we just scale up what we already do with security?

More than rules needed to secure apps

Guntur: I’ll give an analogy as to why older approaches don’t work very well. Think of the older approaches as a big box with, let’s say, a single door. For attackers to get into that big box, all they must do is crack through that single door. 

Guntur

Now, with the newer applications, we have broken that big box into multiple small boxes, and we have given a door to each one of those small boxes. If the attacker wants to get into the application, they only have to get into one of these smaller boxes. And once he gets into one of the smaller boxes, he needs to take a key out of it and use that key to open another box.

By creating API-driven applications, we have exposed a much bigger attack surface. That’s number one. Number two, of course, we have made it challenging to the attackers, but the attack surface being so much bigger now needs to be dealt with in a completely different way.

The older class of applications took a rules-based system as the common approach to solve security use cases. Because they just had a single application and the application would not change that much in terms of the interfaces it exposed, you could build in rules to analyze how traffic goes in and out of that application.

Now, when we break the application into multiple pieces, and we bring in other paradigms of software development, such as DevOps and Agile development methodologies, this creates a scenario where the applications are always rapidly changing. There is no way rules can catch up with these rapidly changing applications. We need automation to understand what is happening with these applications, and we need automation to solve these problems, which rules alone cannot do. 

Gardner: We shouldn’t think of AI here as replacing old security or even humans. It’s doing something that just couldn’t be done any other way.

Guntur: Yes, absolutely. There’s no substitute for human intelligence, and there’s no substitute for the thinking capability of humans. If you go deeper into the AI-based algorithms, you realize that these algorithms are very simple in terms of how the AI is powered. They’re all based on optimization algorithms. Optimization algorithms don’t have thinking capability. They don’t have creativity, which humans have. So, there’s no way these algorithms are going to replace human intelligence.

Learn More 

They are going to work alongside humans to make all the mundane activities easier for humans and help humans look at the more creative and the difficult aspects of security, which these algorithms can’t do out of the box.

Gardner: And, of course, we’re also starting to see that the bad guys, the attackers, the hackers, are starting to rely on AI and ML themselves. You have to fight fire with fire. And so that’s another reason, in my thinking, to use the best combination of AI tools that you can.

Guntur: Absolutely.

Gardner: Another significant and growing security threat are bots, and the scale that threat vector takes. It seems like only automation and the best combination of human and machines can ferret out these bots.

Machines, humans combine to combat attacks

Guntur: You are right. Most of the best detection cases we see in security are a combination of humans and machines. The attackers are also starting to use automation to get into systems. We have seen such cases where the same bot comes in from geographically different locations and is trying to do the same thing in some of the customer locations.

The reason they’re coming from so many different locations is to challenge AI-based algorithms. One of the oldest schools of algorithms looks at rate anomaly, to see how quickly somebody is coming from a particular IP address. The moment you spread the IP addresses across the globe, you don’t know whether it’s different attackers or the same attacker coming from different locations. This kind of challenge has been brought by attackers using AI. The only way to challenge that is by building algorithms to counter them.

One thing is for sure, algorithms are not perfect. Algorithms can generate errors. Algorithms can create false positives. That’s where the human analyst comes in, to understand whether what the algorithm discovered is a true positive or a false positive. Going deeper into the output of an algorithm digs back into exactly how the algorithm figured out an attack is being launched. But some of these insights can’t be discovered by algorithms, only humans when they correlate different pieces of information, can find that out. So, it requires a team. Algorithms and humans work well as a team.

Gardner: What makes the way in which Traceable.ai is doing ML and AI different? How are you unique in your vision and execution for using AI for API security?

Guntur: When you look at any AI-based implementation, you will see that there are three basic components. The first is about the data itself. It’s not enough if you capture a large amount of data; it’s still not enough if you capture quality data. In most cases, you cannot guarantee data of high quality. There will always be some noise in the data. 

But more than volume and quality of data, what is more important is whether the data that you’re capturing is relevant for the particular use-case you’re trying to solve. We want to use the data that is helpful in solving security use-cases.

Traceable.ai built a platform from the ground up to cater to those security use cases. Right from the foundation, we began looking at the specific type of data required to solve modern API-based application security use cases. That’s the first challenge that we address, it’s very important, and brings strength to the product.

Seek differences in APIs

Once you address the proper data issue, the next is about how you learn from it. What are the challenges around learning? What kind of algorithms do we use? What is the scenario when we deploy that in a customer location?

We realized that every customer is completely different and has a completely different set of APIs, too, and those APIs behave differently. The data that goes in and out is different. Even if you take two e-commerce customers, they’re doing the same thing. They’re allowing you to look at products, and they’re selling you products. But the way the applications have been built, and the API architecture -- everything is different.

We realized it's no use to build supervised approaches. We needed to come up with an architecture where the day we deploy at the customer location; the algorithm then self-learns.

We realized it’s no use to build supervised approaches. We needed to come up with an architecture where the day we deploy at the customer location; the algorithm then self-learns. The whole concept of being able to learn on its own just by looking at data is the core to the way we build security using the AI algorithms we have.

Finally, the last step is to look at how we deliver security use cases. What is the philosophy behind building a security product? We knew that rules-based systems are not going to work. The alternate system is modeled around anomaly detection. Now, anomaly detection is a very old subject, and we have used anomaly detection in various things. We have  used it to  understand whether machinery is going to go down, we have used them to understand whether the traffic patterns on the road are going to change, and we have used it for anomaly detection in security.

But within anomaly detection, we focused on behavioral anomalies. We realized that APIs and the people who use APIs are the two key entities in the system. We needed to model the behavior of these two groups -- and when we see any deviation from this behavior, that’s when we’re able to capture the notion of an attack.

Learn More 

Behavioral anomalies are important because if you look at the attacks, they’re so subtle. You just can’t easily find the difference between the normal usage of an API and abnormal usage. But very deep inside the data and very deep into how the APIs are interacting, there is a deviation in the behavior. It’s very hard for humans to figure this out. Only algorithms can tease this out and determine that the behavior is different from a known behavior.

We have addressed this at all levels of our stack: The data-capture level, and the choice of how we want to execute our AI, and the choice of how we want to deliver our security use cases. And I think that’s what makes Traceable unique and holistic. We didn’t just bolt things on, we built it from the ground up. That’s why these three pieces gel well and work well together.

Gardner: I’d like to revisit the concept you brought up about the contextual use of the algorithms and the types of algorithms being deployed. This is a moving target, with so many different use cases and company by company.

How do you keep up with that rate of change? How do you remain contextual?

Function over form delivers context

Guntur: That’s a very good question. The notion of context is abstract. But when you dig deeper into what context is and how you build context, it boils down to basically finding all factors influencing the execution of a particular API.

Let’s take an example. We have an API, and we’re looking at how this API functions. It’s just not enough to look at the input and output of the API. We need to look at something around it. We need to see who triggered that input. Where did the user come from? Was it a residential IP address that the user came in from? Was it a hosted IP address? Which geolocation is the user coming from? Did this user have past anomalies within the system?

You need to bring in all these factors into the notion of context when we’re dealing with API security. Now, it’s a moving target. The context -- because data is constantly changing. There comes a moment when you have fixed this context, when you say that you know where the users are coming from, and you know what the users have done in the past. There is some amount of determinism to whatever detection you’re performing on these APIs.

Let’s say an API takes in five inputs, and it gives out 10 outputs. The inputs and outputs are a constant for every user, but the values that go into the input varies from user to user. Your bank account is different from my bank account. The account number I put in there is different for you, and it’s different for me. If you build an algorithm that looks for an anomaly, you will say, “Hey, you know what? For this part of the field, I’m seeing many different bank account numbers.”

There is some problem with this, but that’s not true. It’s meant to have many variations in that account number, and that determination comes from context. Building a context engine is unique in our AI-based system. It helps us tease out false positives and helps us learn the fact that some variations are genuine.


That’s how we keep up with this constant changing environment, where the environment is changing not just because new APIs are coming in. It’s also because new data is coming into the APIs.

Gardner: Is there a way for the algorithms to learn more about what makes the context powerful to avoid false positives? Is there certain data and certain ways people use APIs that allow your model to work better?

Guntur: Yes. When we initially started, we thought of APIs as rigidly designed. We thought of an API as a small unit of execution. When developers use these APIs, they’ll all be focused on very precise execution between the APIs.

We soon realized that developers bundle various additional features within the same API. We started seeing that they just provide a few more input options, but they get completely different functionality from that same API.

But we soon realized that developers bundle various additional features within the same API. We started seeing that they just provide a few more input options, and by triggering those extra input options you get completely different functionality from the same API.

We had to come up with algorithms that discover that a particular API can behave in multiple ways -- depending on the inputs being transmitted. It’s difficult for us to figure out whether the API is going to change and has ongoing change. But when we built our algorithms, we assumed that an API is going to have multiple manifestations, and we need to figure out which manifestation is currently being triggered by looking at the data.

We solved it differently by creating multiple personas for the same API. Although it looks like a single API, we have an internal representation of an API with multiple personas.

Gardner: Interesting. Another thing that’s fascinating to me about the API security problem is that the way hackers try not to abuse the API. Instead, they have subtle logic abuse attacks where they’re basically doing what the API is designed to do but using it as a tool for their nefarious activities.

How does your model help fight against these subtle logic abuse attacks?

Logic abuse detection

Guntur: When you look at the way hackers are getting into distributed applications and APIs using these attacks – it is very subtle. We classify these attacks as business logic abuse. They are using the existing business logic, but they are abusing it. Now, figuring out abuse to business logic is a very difficult task. It involves a lot of combinatorial issues that we need to solve. When I say combinatorial issues, it’s a problem of scale in terms of the number of APIs, the number of parameters that can be passed, and the types of values that can be passed.

Learn More 

When we built the Traceable.ai platform, it was not enough to just look at the front-facing APIs, we call them the external APIs. It’s also important for us to go deeper into the API ecosystem.

We have two classes of APIs. One, the external facing APIs, and the other is the internal APIs. The internal APIs are not called by users sitting outside of the ecosystem. They’re called by other APIs within the system. The only way for us to identify the subtle logic attacks is to be able to follow the paths taken by those internal APIs.

If the internal APIs are reaching a resource like a database, and within the database it reaches a particular row and column, it then returns the value. Only then you will be able to figure out that there was a subtle attack. We’re able to figure this out only because of the capability to trace the data deep into the ecosystem.

If we had done everything at the API gateway, if we had done everything at external facing APIs, we would not have figured out that there was an attack launched that went deep into the system and touched a resource it should never have touched.

It’s all about how well you capture the data and how rich your data representation is to capture this kind of attack. Once you capture this, using tons of data, and especially graph-like data, you have no option but to use algorithms to process it. That’s why we started using graph-based algorithms to discover variations in behavior, discover outliers, and uncover patterns of outliers, and so on.

Gardner: To fully tackle this problem, you need to know a lot about data integration, a lot about security and the vulnerabilities, as well as a lot about algorithms, AI, and data science. Tell me about your background. How are you able to keep these big, multiple balls in the air at once when it comes to solving this problem? There are so many different disciplines involved.

Multiple skills in data scientist toolbox

Guntur: Yes, it’s been a journey for me. When I initially started in 2005, I had just graduated from university. I used a lot of mathematical techniques to solve key problems in natural language processing (NLP) as part of my thesis. I realized that even security use cases can be modeled as a language. If you take any operating system (OS), we typically have a few system calls, right? About 200 system calls, or maybe 400 system calls. All the programs running in the operating system are using about 400 system calls in different ways to build the different applications.

It’s similar to natural languages. In natural language, you have words, and you compose the words according to a grammar to get a meaningful sentence. Something similar happens in the security world. We realized we could apply techniques from statistical NLP into the security use cases. We discovered, for example, way back then, certain Solaris login buffer and overflow vulnerabilities.

That’s how the journey began. I then went through multiple jobs and worked on different use cases. I learned if you want to be a good data scientist -- or if you want to use ML effectively -- you should think of yourself as a carpenter, as somebody with a toolbox with lots of tools in it, and who knows how to use those tools very well.

But to best use those tools, you also need the experience from building various things. You need to build a chair, a table, and a house. You need to build various things using the same set of tools, and that took me further along that journey.

While I began with NLP, I soon ventured into image processing and video processing, and I applied that to security, too. It furthered the journey. And through that whole process, I realized that almost all problems can be mapped to canonical forms. You can take any complex problem and break it down into simpler problems. Almost all fields can be broken down into simple mathematical problems. And if you know how to use various mathematical concepts, you can solve a lot of different problems.

We are applying these same principles at Traceable.ai as well. Yes, it’s been a journey, and every time you look at data you come up with different challenges. The only way to overcome that is to dirty your hands and solve it. That’s the only way to learn and the only way we could build this new class of algorithms -- by taking a piece from here, a piece from there, putting it together, and building something different. 

Gardner: To your point that complex things in nature, business, and technology can be brought down to elemental mathematical understandings, once you’ve attained that with APIs, for example, applying this first to security, and rightfully so, it’s the obvious low-lying fruit.

But over time, you also gain mathematical insights and understanding of more about how microservices are used and how they could be optimized. Or even how the relationship between developers and the IT production crews might be optimized.

Is that what you’re setting the stage for here? Will that mathematical foundation be brought to a much greater and potentially productive set of a problem-solving?

Something for everybody

Guntur: Yes, you’re right. If you think about it, we have embarked on that journey already. Based on what we have achieved as of today, and we look at the foundations over which we have built that, we see that we have something for everybody.

For example, we have something for the security folks as well as for the developer folks. The Traceable.ai system gives insights to developers as to what happens to their APIs when they’re in production. They need to know that. How is it all behaving? How many users are using the APIs? How are they using them? Mostly, they have no clue.

The mathematical foundation under which all these implementations are being done is based on relationships, relationships between APIs. You can call them graphs, but it's all about relationships.

And on the other side, the security team doesn’t know exactly what the application is. They can see lots of APIs, but how are the APIs glued together to form this big application? Now, the mathematical foundation under which all these implementations are being done is based on relationships, relationships between APIs. You can call them graphs, you can call them sequences, but it’s all about relationships.

One aspect we are looking at is how do you expose these relationships. Today we have this relationship buried deep inside of our implementations, inside our platform. But how do you take it out and make it visual so that you can better understand what’s happening? What is this application? What happens to the APIs?

By looking at these visualizations, you can easily figure out if there are bottlenecks within the application, for example. Is one API constantly being hit on? If I always go through this API, but the same API is also leading me to a search engine or a products catalog page, why does this API need to go through all these various functions? Can I simplify the API? Can I break it down and make it into multiple pieces? These kinds of insights are now being made available to the developer community.

Gardner: For those listening or reading this interview, how should they prepare themselves for being better able to leverage and take advantage of what Traceable.ai is providing? How can developers, security teams, as well as the IT operators get ready?

Rapid insights result in better APIs

Guntur: The moment you deploy Traceable in your environment, the algorithms kick in and start learning about the patterns of traffic in your environment. Within a few hours -- or if your traffic has high volume, within 48 hours -- you will receive insights into the API landscape within your environment. This insight starts with  how many APIs are there in your environment.  That’s a fundamental problem that a lot of companies are facing today. They just don’t know how many APIs exist in their environment at any given point of time. Once you know how many APIs are there, you can figure out how many services there are. What are the different services, and which APIs belong to which services? 

Traceable gives you the entire landscape within a few hours of deployment. Once you understand your landscape, the next interesting thing to see are your interfaces. You can learn how risky your APIs are. Are you exposing sensitive data? How many of the APIs are external facing? How to best use authentication to give access to APIs or not? And why do some APIs not have authentication? How are you exposing APIs without authentication?

Learn More 

All these questions are answered right there in the user interface. After that, you can look at whether your development team is in compliance. Do the APIs comply with the specifications in the requirements? Because usually the development teams are rapidly churning out code, they almost never maintain the API’s spec. They will have a draft spec and they will build against it, but finally, when you deploy it, the spec looks very different. But who knows it’s different? How do you know it’s different?

Traceable’s insights tell you whether your spec is compliant. You get to see that within a few hours of deployment. In addition to knowing what happened to your APIs and whether they are compliant with the spec, you start seeing various behaviors.

People think that when you have 100 APIs deployed, all users use those APIs the same way. We think all of them are using the apps the same way. But you’d be surprised to learn  that users use apps in many different ways. Sometimes the APIs are accessed through computational means, sometimes they are accessed via user interfaces. There is now insight for the development team on how users are actually using the APIs, which in itself is a great insight to help build better APIs, which helps build better applications, and simplifies the application deployments.

All of these insights are available within a few hours of the Traceable.ai deployment. And I think that’s very exciting. You just deploy it and open the screen to look at all the information. It’s just fascinating to see how different companies have built their API ecosystems.

And, of course, you have the security use cases. You start seeing what’s at work. We have seen, for example, what Bingbot from Microsoft looks like. But how active is it? Is it coming from 100 different IP addresses, or is it always coming from one part of a geolocation?


You can see how, for example, what search spiders’ activity looks like. What are they doing with our APIs? Why is the search engine starting to look at the APIs, which are internal language and have no information? But why are they crawling these APIs? All this information is available to you within a few hours. It’s really fascinating when you just deploy and observe.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: Traceable.ai.

You may also be interested in: