Like many retailers, Recreational Equipment, Inc. (REI) was faced with drastic and rapid change when the COVID-19 pandemic struck. REI’s marketing leaders wanted to make sure that their online e-commerce capabilities would rise to the challenge. They expected a nearly overnight 150 percent jump in REI’s purely digital business.
Fortunately REI’s IT
leadership had already advanced their systems to heightened automation, which
allowed the Seattle-based merchandiser to turn on a dime and devote much more
of its private cloud to the new e-commerce workload demands.
The next BriefingsDirect
Voice of Innovation interview uncovers how
REI kept its digital customers and business leadership happy, even as the world
around them was suddenly shifting.
To explore what works for
making IT agile and responsive enough to re-factor a private cloud at breakneck
speed, we’re joined by Bryan Sullins,
Senior Cloud Systems Engineer at REI
in Seattle. The discussion is moderated by Dana Gardner, Principal
Analyst at Interarbor Solutions.
Here are some excerpts:
In order to do that, we’ve used a combination of technologies. HPE actually has a GitHub link for a lot of Ansible playbooks that plug right in. And then the underlying hardware adjacent management ecosystem platform is HPE OneView with HPE Synergy and Image Streamer. With a combination of all of those technologies we were able to accomplish that 18-minute roll-out of our various titles.
Gardner: To close out, you were ahead of the curve on digital transformation. That allowed you to be very agile when it came time to react to the COVID-19 pandemic. What did that get you? Do you have any results?
Here are some excerpts:
Gardner: When
the pandemic required you to hop-to, how did REI manage to have the IT
infrastructure to actually move at the true pace of business? What put you in a
position to be able to act as you did?
Digital retail demands rise
Sullins: In
addition to the pandemic stay-at-home orders a couple months ago, we also had a
large sale previously scheduled for the middle of May. It’s the largest sale of
the year, our anniversary sale.
Sullins |
And ramping up to that, our marketing
and sales department realized that we would have a huge uptick in online sales.
People really wanted to get outside, because people could go outside without
breaking any of the social distancing rules.
For example, bicycle sales
were up 310 percent compared to the same time last year. So in ramping up for that,
we anticipated our online presence at rei.com was
going to go up by 150 percent, but we wanted to scale up by 200 percent to be
sure. In order to do that, we had to reallocate a bunch of ESXi hosts in VMware vSphere. We
either had to stand up new ones or reallocate from other clusters and put them
into what we call our digital retail presence.
As a result of our fully
automated process, using Hewlett
Packard Enterprise (HPE) OneView, Synergy, and
Image Streamer,
we were able to reallocate 6 out of the 17 total hosts needed. We were able to
do that in 18 minutes, all at once -- and that’s single touch, that’s launching
the automation and then pulling them from one cluster and decommissioning them and
placing them all the way into the digital retail clusters.
We also had to move some from our
legacy platform, they aren’t at HPE Synergy yet, and those took an additional
three days. But those are in transition, we are moving through to that fully
automated platform all around.
Gardner: That’s
amazing because just a few years ago that sort of rapid and automated transition
would have been unheard of. Even at a slow pace you weren’t guaranteed to have
the performance and operations you wanted.
If you were not able to do
this using automation – if the pandemic had hit, heaven forbid, five or seven
years ago – what would have been the outcome?
We
needed to make sure we had the infrastructure capacity so that nothing
failed under a heavy load. We were able to do it in the time-frame, and
be able to get some sleep.
Sullins: There
were actually two outcomes from this. The first is the fairly obvious issue of not
being able to handle the online traffic on our rei.com
retail presence. It could have been that people weren’t able to put stuff into a
shopping cart, or inventory decrement, and so on. It could have been a very
broad range of things. We needed to make sure we had the infrastructure
capacity so that none of that fails under a heavy load. That was the first
part.
Gardner: Right,
and when you have people in the heat of a purchasing moment, if you’re not
there and it’s not working, they have other options. Not only would you lose
that sale, you might lose that customer, and your brand suffers as well.
Sullins: Oh,
without a doubt, without a doubt.
The other issue, of course,
would have been if we did not meet our deadline. We had just under a week to
get this accomplished. And if we had to do this without a fully automated
approach, we would have had to return to our managers and say, “Yeah, so like
we can’t do it that quickly.” But with our approach, we were able to do it all in
the time frame -- and be able to get some sleep in the interim. So it was a
win-win.
Gardner: So
digital transformation pays off after all?
Sullins:
Without a doubt.
Gardner: Before
we learn more about your journey to IT infrastructure automation, tell us about
REI, your investments in advanced automation, and why you consider yourself a
data-driven digital business?
Automation all the way
Sullins: Well,
a lot of that precedes me by quite a bit. Going back to the early 2000s, based
on what my managers tell me, there was a huge push for REI become an IT organization that just
happens to do retail. The priority is on IT being a driving force behind
everything we do, and that is something that, at the time, REI really needed to
do. There are other competitors, which we won’t name, but you probably know who
they are. REI needed to stay ahead of that curve.
So since then there have been
constant sweeping and cyclical changes for that digital transformation. The
most recent one is the push for automating all things. So that’s the priority
we have. It’s our marching orders.
Gardner: In
addition to your company, culture, and technology, tell us about yourself, Bryan.
What is it about your background and personal development that led you to be in
a position to act so forthrightly and swiftly?
Sullins: I got
my start in IT back in 1999. I was a public school teacher before that, and
then I made the transition to doing IT training. I did IT training from 1999 to
about 2012. During those years, I got a lot of technology
certifications, because in the IT training
world you have to.
I began with what was, at the
time, called the Microsoft Certified Solutions Expert (MCSE) certification. Then
I also did the Linux Professional Institute. I really glommed on to Linux. I wanted
to set myself apart from the rest of the field back then, so I went all-in on
Linux.
And then, 2008-2009-ish, I
jumped on the VMware train and went all-in
on VMware and did the official VMware curriculum. I taught that for about three
years. Then, in 2012, I made the transition from IT training into actually
doing this for real as an engineer working at Dell. At the time, Dell had an infrastructure-as-a-service
(IaaS) healthcare cloud that was fairly large – 1,200-plus ESXi hosts. We were
also responsible for the storage and for the 90-plus storage area network (SAN)
arrays as well.
In
a large environment, you really have to automate. It's been the focus
of my career. I typically jump right into new technology.
In an environment that large,
you really have to automate. I cut my teeth on automating through PowerCLI and
Ansible. Since then, about 2015, it’s
been the focus of my career. I’m not saying I’m a guru, by any means, but it’s
been a focus of my career.
Then, in 2018, REI came
calling. I jumped on that opportunity because they are a super-awesome company,
and right off the bat I got free reign over: if you want to automate it, then
you automate it. And I have been doing that ever since August of 2018.
Gardner: What
helped you make the transition from training to cloud engineer?
Sullins: I
typically jump right into new technology. I don’t know if that comes from the
training or if that’s just me as a person. But one of the positives I’ve gotten
from the training world is that you learn a 100 percent of the feature base
that’s available with said technology. I was able to take what I learned and
knew from VMware and then say, “Okay, well, now I am going to get the real-world
experience to back that up as well.” So it was a good transition.
Gardner: Let’s
look at how other organizations can anticipate the shift to automation. What
are some of the challenges that organizations typically face when it comes to
being agile with their infrastructure?
Manage resistance to cloud
Sullins: The
challenges that I have seen aren’t usually technical. Usually the technology
that people use to automate things are ready at hand. Many are free; like
Ansible, for example, is free. PowerCLI
is free. Jenkins is free.
So, people can start doing that
tomorrow. But the real challenge is in changing people’s mindset about a more
automated approach. I think that it’s tough to overcome. It’s what I call provisioning
by council. More traditional on-premises approaches have application owners
who want to roll out x number of virtual machines (VMs), with all their
particular specs and whatnot. And then a council of people typically looks at that
and kind of scratches their chin and says, “Okay, we approve.” But if you need to
scale up, that council approach becomes a sort of gate-keeping process.
With a more automated approach,
like we have at REI, we use a cloud management platform to automate the processes.
We use that to enable self-service VMs instead of having a roll out by council,
where some of the VMs can take days or weeks roll out because you have a lot of
human beings touching it along the way. We have a lot of that process pre-approved,
so everybody has already said, “Okay, we are okay with the roll out. We are
okay with the way it’s done.” And then we can roll that out in 7 to 10 minutes
rather than having a ticket-based model where somebody gets to it when they can.
Self-service models are able to do that much better.
But that all takes a pretty
big shift in psychology. A lot of people are used to being the gatekeeper. It
can make them uncomfortable to change. Fortunately for me, a lot of the people
at REI are on-board with this sort of approach. But I think that resistance can
be something a lot of people run into.
Gardner: You
can’t just buy automation in a box off of a shelf. You have to deal with an accumulation
of manual processes and habits. Why is moving beyond the manual processes culture
so important?
Sullins: I
call it a private cloud because that means there is a healthy level of
competition between what’s going in the public cloud and what we do in that
data center.
The public cloud team has the capability
of “selling” their solution side-by-side with ours. When you have application
owners who are technically adept -- and pretty much all of them are at REI -- they
can be tempted to say, “Well, I don’t want to wait a week or two to get a VM. I
want to create one right now out on the public cloud.”
There
is a healthy level of competition between what's going in the public
cloud and what we do in the date center. We offer our customers a
spectrum of services. And now they can do that in an automated way.
That's a big win.
That’s a big challenge for us.
So what we are trying to accomplish -- and we have had success so far through
the transition – is to offer our customers a spectrum of
services. So that’s great.
The stakeholders consuming that
now gain flexibility. They can say, “Okay, yeah, I have this application. I
want to run it in the public cloud, but I can’t based on the needs for that
application. We have to run it on-premises.” And now they can do that in an
automated way. That’s a big win, and that’s what people expect now, quite
honestly.
Gardner: They
want the look and feel of a public cloud but with all the benefits of the
private cloud. It’s up to you to provide that. Let’s find out how you did.
How did you overcome the
challenges that we talked about and what are the investments that you made in
tools, platforms, and an ecosystem of players that accomplished it?
Sullins: As I
mentioned previously, a lot of our utilities are “free,” the Ansibles of the
world, PowerCLI, and whatnot. We also use Morpheus
to do self-service and the implications behind automating things on what I call
the front end, the customer face. The issue you have there is you don’t get
that control of scaling up before you provision the VM. You have to monitor and
then roll it out on the backend. So you have to monitor for usage and then
scale up on the backend, and seamlessly. The end users aren’t supposed to know
that you are scaling up. I don’t want them to know. It’s not their job to know.
I want to remain out of their way.
In order to do that, we’ve used a combination of technologies. HPE actually has a GitHub link for a lot of Ansible playbooks that plug right in. And then the underlying hardware adjacent management ecosystem platform is HPE OneView with HPE Synergy and Image Streamer. With a combination of all of those technologies we were able to accomplish that 18-minute roll-out of our various titles.
Gardner: Even
though you have an integrated platform and solutions approach, it sounds like
you have also made the leap from ushering pets through the process into herding
cattle. If you understand my metaphor, what has allowed you to stop treating
each instance as a pet into being able to herd this stuff through on an
automated basis?
From brittle pets to agile cattle
Sullins: There
is a psychological challenge with that. In the more traditional approach – and the
VMware shop listeners are going to be very well aware of this -- I may need to
have a four-node cluster with a number of CPUs, a certain amount of RAM, and so
on. And that four-node cluster is static. Yes, if I need to add a fifth down
the line I can do that, but for that four-node cluster, that’s its home,
sometimes for the entire lifecycle of that particular host.
With our approach, we treat our
ESXi hosts as cattle. The HPE OneView-Synergy-Image Streamer technology
allows us to do that in conjunction with those tools we mentioned previously, for
the end point in particular.
So rather than have a cluster,
and it’s static and it stays that way -- it might have a naming convention that
indicates what cluster it’s in and where -- in reality we have cattle-based DNS
names for ESXi hosts. At any time, the understanding throughout the
organization, or at least for the people who need to know, is that any host can
be pulled from one cluster automatically and placed into another, particularly
when it comes to resource usage on that cluster. My dream is that the robots
will do this automatically.
So if you had a cluster that
goes into the yellow, with its capacity usage based on a threshold, the robot would
interpret that and say, “Oh, well, I have another cluster over here with a host
that is underutilized. I’m going to pull it into the cluster that’s in the
yellow and then bring it back into the green again.” This would happen all
while we sleep. When we wake up in the morning, we’d say, “Oh, hey, look at
that. The robots moved that over.”
Gardner:
Algorithmic operations. It sounds very exciting.
Automation begets more automation
Sullins: Yes,
we have the push-button automation in place for that. It’s the next level of
what that engine is that’s going to make those decisions and do all of those
things.
Gardner: And
that raises another issue. When you take the plunge into IT automation, you are
making your way down the Chisholm
Trail with your cattle, all of a sudden it becomes easier along the way. The
automation begets more automation. As you learn and grow, does it become more
automated along the way?
Sullins: Yes.
Just to put an exclamation point on this topic, imagine the situation we opened
the podcast with, which is, “Okay, we have to reallocate a bunch of hosts for rei.com.” If it’s
fully automated, and we have robots making those decisions, the response is
instantaneous. “Oh, hey, we want to scale up by 200 percent on rei.com.” We can say, “Okay, go ahead, roll out
your VM. The system will react accordingly. It will add physical hosts as you
see fit, and we don’t have to do anything, we have already done the work with
the automation.” Right?
But to the automation
begetting automation, which is a great way of putting it, by the way, there
are always opportunities for more automation. And on a career side note, I want
to dispel the myth that you automate your way out of a job. That is a complete
and total myth. I’m not saying it doesn’t happen, where people get laid off as
a result of automation. I’m not saying that doesn’t happen, but that’s
relatively rare because when you automate something, that automation is going
to need to be maintained because things change over time.
The other piece of that is a
lot of times you have different organizations at various states of automation.
Once you get your head above water to where it's, “Okay, we have this process
and now it's become trivial because it's been automated.” We can now concentrate
on automating either more things -- or you have new things that need to be
automated. And whether that’s the process for only VMs, a new feature base,
monitoring, or auto-scaling -- whatever it is -- you have the capability of
from day one to further automate these processes.
Gardner: What
was it specifically about the HPE OneView and Synergy that allowed you to move
past the manual processes, firefighting, and culture of gatekeeping into more
herding of cattle and being progressively automated?
Sullins: It
was two things. The Image Streamer was number one. To date, we don’t run PXE boots
infrastructure, not that we can't, it’s just not something that we have
traditionally done. We needed a more standard process for doing that, and Image
Streamer fit that and solved that problem.
The second piece is the
provided Ansible playbooks that HPE has to kick off the entire process. If you
are somewhat versed in how HPE does things through OneView, you have a server
profile that you can impose on a blade, and that can be fully automated through
Ansible.
Image
Streamer allows us to say, "Okay, we build a gold image. We can apply
that gold image to any frame in the cluster." We needed a more standard
process, and Image Streamer solved that problem.
And, by the way, you don’t
have to use Image Streamer to use Ansible automation. This is really more of an
HPE OneView approach, whereby you can actually use it to do automated profiles
and whatnot. But the Image Streamer is really what allows us to say, “Okay, we
build a gold image. We can apply that gold image to any frame in the cluster.”
That’s the first part of it, and the rest is configuring the other side.
Gardner: Bryan,
it sounds like the HPE
Composable Infrastructure approach works well with others. You are able to
have it your way because you like Ansible, and you have a history of certain
products and skills in your organization. Does the HPE Composable
Infrastructure fit well into an ecosystem? Is it flexible enough to integrate
with a variety of different approaches and partners?
Sullins: It
has been so far, yes. We have anticipated leveraging HPE for our bare metal
Linux infrastructure. One of the additional driving forces and big initiatives
right now is Kubernetes. We are going all-in
on Kubernetes in our private cloud, as well as in some of our worker nodes.
We eventually plan on running those as bare metal. And HPE OneView, along with
Image Streamer, is something that we can leverage for that as well. So there is
flexibility, absolutely, yes.
Coordinating containers
Gardner: It’s
interesting, you have seen the transition from having VMware and other
hypervisor sprawl to finding a way to manage and automate all of that. Do you
see the same thing playing out for containers, with the powerful endgame of
being able to automate
containers, too?
Sullins:
Right. We have been utilizing Rancher as
part of our coordination tool for our Kubernetes infrastructure and utilizing
vSphere for that. So we are using that.
As far as the containerization
approach, REI has been doing containers before containers was a big thing. Our
containerization platform has been around since at least 2015. So REI has been
pretty cutting edge as far as that is concerned.
And now that Kubernetes has
won the orchestration wars, as it were, we are looking to standardize that for
people who want to do things online, which is to say, going back to the digital
transformation journey.
Basically, the industry has
caught up with what our super-awesome developers have done with
containerization. But we are looking to transition the heavy lifting of
maintaining a platform away from the developers. Now that we have a standard
approach with Kubernetes, they don’t have to worry so much about it. They can
just develop what they need to develop. It will be a big win for us.
Gardner: As
you look back at your automation journey, have you developed a philosophy about
automation? How this should this best work in the future?
Trust as foundation of automation
Sullins:
Right. Have you read Gene
Kim’s The
Unicorn Project? Well, there is also his The
Phoenix Project. My take from that is the whole idea of trust,
of trusting other people. And I think that is big.
I see that quite a bit in
multiple organizations. For REI, we are going to work as a team and we trust
each other. So we have a pretty good culture. But I would imagine that in some
places that is still big challenge.
And if you take a look at The
Unicorn Project, a lot of the issues have to do with trusting other human
beings. Something happened, somebody made a mistake, and it caused an outage. So
they lock it up and lock it away and say only certain people can do that. And
then if you multiply that happening multiple times -- and then different individuals
walking that down -- it leads to not being able to automate processes without
somebody approving it, right?
Gardner: I
can't imagine you would have been capable, when you had to transition your
private cloud for more online activity, if you didn’t have that trust built
into your culture.
Sullins: Yes,
and the big challenge that might still come up is the idea of trusting your end
users, too. Once you go into the realm of self-service, you come up on the typical
what-ifs. What if somebody adds a zero and they meant to only roll out 4 VMs but
they roll out 40? That’s possible. How do you create guardrails that are
seamless? If you can, then you can trust your users. You decrease the risk and
can take that leap of faith that bad things won’t happen.
Gardner: Tell
us about your wish list for what comes next. What you would like HPE to be
doing?
Small steps and teamwork rewards
Sullins: My
approach is to first automate one thing and then work out from there. You don’t
have to boil the ocean. Start with something small and work your way up.
As far as next steps, we want
auto scaling a physical layer and having the robots do all of that. The robots
will scale up and down our requesters while we sleep.
We will continue to do application
programming interface (API)-capable automation with anything that has a REST API.
If we can connect to that and manipulate it, we can do pretty much whatever
automation we want.
We are also containerizing all
things. So if any application can be containerized properly, containerize it if
you can.
As far as what decision-making
engine we have to do the auto-scaling on the physical layer, we haven’t really
decided upon what that is. We have some ideas but we are still looking for
that.
Gardner: How
about more predictive analytics using artificial intelligence (AI) with the
data that you have emanating from your data center? Maybe AIOps?
Sullins: Well,
without a doubt. I, for one, haven’t done any sort of deep dive into that, but I
know it’s all the rage right now. I would be open to pretty much anything that
will encompass what I just talked about. If that’s HPE InfoSight, then that’s what
it is. I don’t have a lot of experience quite honestly with InfoSight as of
yet. We do have it installed in a proof of concept (POC) form, although a lot
of the priorities for that have been shifted due to COVID-19. We hope to
revisit that pretty soon, so absolutely.
Gardner: To close out, you were ahead of the curve on digital transformation. That allowed you to be very agile when it came time to react to the COVID-19 pandemic. What did that get you? Do you have any results?
Sullins: Yes,
as a matter of fact, our boss’s boss, his boss -- so three bosses up from me --
he actually sits in on our load testing. It was an all-hands-on-deck situation
during that May online sale. He said that it was the most seamless one that he
had ever seen. There were almost no issues with this one.
We
had done what we needed on the infrastructure side to make sure that we
met dynamic demands. It was very successful. We went past our goals, so
it was a win-win all the way around.
What I attribute that to is,
yes, we had done what we needed on the infrastructure side to make sure that we
met dynamic demands. Also, everybody worked as a team. Everybody, all the way up
the stacks, from our infrastructure contribution, to the hypervisor and
hardware layer, all the way on up to the application layer and the containers,
and all of our DevOps stuff. It was very successful. We went past our goals of
what we had thought for the sale, so it was a win-win all the way around.
Gardner: Even
though you were going through this terrible period of adjustment, that’s very
impressive.
Sullins: Yes.
Listen
to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: Hewlett Packard Enterprise.
You may also be
interested in:
- How IT modern operational services enables self-managing, self-healing, and self-optimizing
- HPE Pointnext’s Nine-Step Plan for Enterprises to Attain the New Business Normal
- As containers go mainstream, IT culture should pivot to end-to-end DevSecOps
- AI-first approach to infrastructure design extends analytics to more high-value use cases
- How Intility uses HPE Primera intelligent storage to move to 100 percent data uptime
- As hybrid IT complexity ramps up, operators look to data-driven automation tools
- Cerner’s lifesaving sepsis control solution shows the potential of bringing more AI-enabled IoT to the healthcare edge
- How containers are the new basic currency for pay as you go hybrid IT
- HPE strategist Mark Linesch on the surging role of containers in advancing the hybrid IT estate
No comments:
Post a Comment