Tuesday, August 15, 2017

DreamWorks Animation crafts its next era of dynamic IT infrastructure

The next BriefingsDirect Voice of the Customer thought leader interview examines how DreamWorks Animation is building a multipurpose, all-inclusive, and agile data center capability.

Learn here why a new era of responsive and dynamic IT infrastructure is demanded, and how one high-performance digital manufacturing leader aims to get there sooner rather than later. 

Listen to the podcast. Find it on iTunes. Get the mobile app. Read a full transcript or download a copy.

Here to describe how an entertainment industry innovator leads the charge for bleeding-edge IT-as-a-service capabilities is Jeff Wike, CTO of DreamWorks Animation in Glendale, California. The discussion is moderated by Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Tell us why the older way of doing IT infrastructure and hosting apps and data just doesn't cut it anymore. What has made that run out of gas?

Wike: You have to continue to improve things. We are in a world where technology is advancing at an unbelievable pace. The amount of data, the capability of the hardware, the intelligence of the infrastructure are coming. In order for any business to stay ahead of the curve -- to really drive value into the business – it has to continue to innovate.

Gardner: IT has become more pervasive in what we do. I have heard you all refer to yourselves as digital manufacturing. Are the demands of your industry also a factor in making it difficult for IT to keep up?

Wike: When I say we are a digital manufacturer, it’s because we are a place that manufacturers content, whether it's animated films or TV shows; that content is all made on the computer. An artist sits in front of a workstation or a monitor, and is basically building these digital assets that we put through simulations and rendering so in the end it comes together to produce a movie.

Wike
That's all about manufacturing, and we actually have a pipeline, but it's really like an assembly line. I was looking at a slide today about Henry Ford coming up with the first assembly line; it's exactly what we are doing, except instead of adding a car part, we are adding a character, we’re adding a hair to a character, we’re adding clothes, we’re adding an environment, and we’re putting things into that environment.

We are manufacturing that image, that story, in a linear way, but also in an iterative way. We are constantly adding more details as we embark on that process of three to four years to create one animated film.

Gardner: Well, it also seems that we are now taking that analogy of the manufacturing assembly line to a higher plane, because you want to have an assembly line that doesn't just make cars -- it can make cars and trains and submarines and helicopters, but you don't have to change the assembly line, you have to adjust and you have to utilize it properly.

So it seems to me that we are at perhaps a cusp in IT where the agility of the infrastructure and its responsiveness to your workloads and demands is better than ever.

Greater creativity, increased efficiency

Wike: That's true. If you think about this animation process or any digital manufacturing process, one issue that you have to account for is legacy workflows, legacy software, and legacy data formats -- all these things are inhibitors to innovation. There are a lot of tools. We actually write our own software, and we’re very involved in projects related to computer science at the studio.

We’ll ask ourselves, “How do you innovate? How can you change your environment to be able to move forward and innovate and still carry around some of those legacy systems?”

How HPE Synergy
Infrastructure Operations

And one of the things we’ve done over the past couple of years is start to re-architect all of our software tools in order to take advantage of massive multi-core processing to try to give artists interactivity into their creative process. It’s about iterations. How many things can I show a director, how quickly can I create the scene to get it approved so that I can hand it off to the next person, because there's two things that you get out of that.

One, you can explore more and you can add more creativity. Two, you can drive efficiency, because it's all about how much time, how many people are working on a particular project and how long does it take, all of which drives up the costs. So you now have these choices where you can add more creativity or -- because of the compute infrastructure -- you can drive efficiency into the operation.

So where does the infrastructure fit into that, because we talk about tools and the ability to make those tools quicker, faster, more real-time? We conducted a project where we tried to create a middleware layer between running applications and the hardware, so that we can start to do data abstraction. We can get more mobile as to where the data is, where the processing is, and what the systems underneath it all are. Until we could separate the applications through that layer, we weren’t really able to do anything down at the core.

Core flexibility, fast

We want to be able to change how we are using that infrastructure -- examine usage patterns, the workflows -- and be able to optimize.
Now that we have done that, we are attacking the core. When we look at our ability to replace that with new compute, and add the new templates with all the security in it -- we want that in our infrastructure. We want to be able to change how we are using that infrastructure -- examine usage patterns, the workflows -- and be able to optimize.

Before, if we wanted to do a new project, we’d say, “Well, we know that this project takes x amount of infrastructure. So if we want to add a project, we need 2x,” and that makes a lot of sense. So we would build to peak. If at some point in the last six months of a show, we are going to need 30,000 cores to be able to finish it in six months, we say, “Well, we better have 30,000 cores available, even though there might be times when we are only using 12,000 cores.” So we were buying to peak, and that’s wasteful.

What we wanted was to be able to take advantage of those valleys, if you will, as an opportunity -- the opportunity to do other types of projects. But because our infrastructure was so homogeneous, we really didn't have the ability to do a different type of project. We could create another movie if it was very much the same as a previous film from an infrastructure-usage standpoint.

By now having composable, or software-defined infrastructure, and being able to understand what the requirements are for those particular projects, we can recompose our infrastructure -- parts of it or all of it -- and we can vary that. We can horizontally scale and redefine it to get maximum use of our infrastructure -- and do it quickly.

Gardner: It sounds like you have an assembly line that’s very agile, able to do different things without ripping and replacing the whole thing. It also sounds like you gain infrastructure agility to allow your business leaders to make decisions such as bringing in new types of businesses. And in IT, you will be responsive, able to put in the apps, manage those peaks and troughs.

Does having that agility not only give you the ability to make more and better movies with higher utilization, but also gives perhaps more wings to your leaders to go and find the right business models for the future?

Wike: That’s absolutely true. We certainly don't want to ever have a reason to turn down some exciting project because our digital infrastructure can’t support it. I would feel really bad if that were the case.

In fact, that was the case at one time, way back when we produced Spirit: Stallion of the Cimarron. Because it was such a big movie from a consumer products standpoint, we were asked to make another movie for direct-to-video. But we couldn't do it; we just didn’t have the capacity, so we had to just say, “No.” We turned away a project because we weren’t capable of doing it. The time it would take us to spin up a project like that would have been six months.

The world is great for us today, because people want content -- they want to consume it on their phone, on their laptop, on the side of buildings and in theaters. People are looking for more content everywhere.

Yet projects for varied content platforms require different amounts of compute and infrastructure, so we want to be able to create content quickly and avoid building to peak, which is too expensive. We want to be able to be flexible with infrastructure in order to take advantage of those opportunities.

HPE Synergy
Infrastructure Operations

Gardner: How is the agility in your infrastructure helping you reach the right creative balance? I suppose it’s similar to what we did 30 years ago with simultaneous engineering, where we would design a physical product for manufacturing, knowing that if it didn't work on the factory floor, then what's the point of the design? Are we doing that with digital manufacturing now?

Artifact analytics improve usage, rendering

We always look at budgets, and budgets can be money budgets, they can be rendering budgets, they can be storage budgets, and networking -- all of those things are commodities that are required to create a project. 
Wike: It’s interesting that you mention that. We always look at budgets, and budgets can be money budgets, it can be rendering budgets, it can be storage budgets, and networking -- I mean all of those things are commodities that are required to create a project.

Artists, managers, production managers, directors, and producers are all really good at managing those projects if they understand what the commodity is. Years ago we used to complain about disk space: “You guys are using too much disk space.” And our production department would say, “Well, give me a tool to help me manage my disk space, and then I can clean it up. Don’t just tell me it's too much.”

One of the initiatives that we have incorporated in recent years is in the area of data analytics. We re-architected our software and we decided we would re-instrument everything. So we started collecting artifacts about rendering and usage. Every night we ran every digital asset that had been created through our rendering, and we also collected analytics about it. We now collect 1.2 billion artifacts a night.

And we correlate that information to a specific asset, such as a character, basket, or chair -- whatever it is that I am rendering -- as well as where it’s located, which shot it’s in, which sequence it’s in, and which characters are connected to it. So, when an artist wants to render a particular shot, we know what digital resources are required to be able to do that.

One of the things that’s wasteful of digital resources is either having a job that doesn't fit the allocation that you assign to it, or not knowing when a job is complete. Some of these rendering jobs and simulations will take hours and hours -- it could take 10 hours to run.

At what point is it stuck? At what point do you kill that job and restart it because something got wedged and it was a dependency? And you don't really know, you are just watching it run. Do I pull the plug now? Is it two minutes away from finishing, or is it never going to finish?

Just the facts

Before, an artist would go in every night and conduct a test render. And they would say, “I think this is going to take this much memory, and I think it's going to take this long.” And then we would add a margin of error, because people are not great judges, as opposed to a computer. This is where we talk about going from feeling to facts.

So now we don't have artists do that anymore, because we are collecting all that information every night. We have machine learning that then goes in and determines requirements. Even though a certain shot has never been run before, it is very similar to another previous shot, and so we can predict what it is going to need to run.

By doing that machine learning and taking the guesswork out of the allocation of resources, we were able to save 15 percent of our render time, which is huge.
Now, if a job is stuck, we can kill it with confidence. By doing that machine learning and taking the guesswork out of the allocation of resources, we were able to save 15 percent of our render time, which is huge.

I recently listened to a gentleman talk about what a difference of 1 percent improvement would be. So 15 percent is huge, that's 15 percent less money you have to spend. It's 15 percent faster time for a director to be able to see something. It's 15 percent more iterations. So that was really huge for us.

Gardner: It sounds like you are in the digital manufacturing equivalent of working smarter and not harder. With more intelligence, you can free up the art, because you have nailed the science when it comes to creating something.

Creative intelligence at the edge

Wike: It's interesting; we talk about intelligence at the edge and the Internet of Things (IoT), and that sort of thing. In my world, the edge is actually an artist. If we can take intelligence about their work, the computational requirements that they have, and if we can push that data -- that intelligence -- to an artist, then they are actually really, really good at managing their own work.

It's only a problem when they don't have any idea that six months from now it's going to cause a huge increase in memory usage or render time. When they don't know that, it's hard for them to be able to self-manage. But now we have artists who can access Tableau reports everyday and see exactly what the memory usage was or the compute usage of any of the assets they’ve created, and they can correct it immediately.

On Megamind, a film DreamWorks Animation released several years ago, it was prior to having the data analytics in place, and the studio encountered massive rendering spikes on certain shots. We really didn't understand why.

After the movie was complete, when we could go back and get printouts of logs to analyze, we determined that these peaks in rendering resources were caused by his watch. Whenever the main character’s watch was in a frame, the render times went up. We looked at the models, and well-intended artists had taken a model of a watch and every gear was modeled, and it was just a huge, heavy asset to render.

But it was too late to do anything about it. But now, if an artist were to create that watch today, they would quickly find out that they had really over-modeled that watch. We would then need to go in and reduce that asset down, because it's really not a key element to the story. And they can do that today, which is really great.

HPE Synergy
Infrastructure Operations

Gardner: I am a big fan of animated films, and I am so happy that my kids take me to see them because I enjoy them as much as they do. When you mention an artist at the edge, it seems to me it’s more like an army at the edge, because I wait through the end of the movie, and I look at the credits scroll -- hundreds and hundreds of people at work putting this together.

So you are dealing with not just one artist making a decision, you have an army of people. It's astounding that you can bring this level of data-driven efficiency to it.

Movie-making’s mobile workforce

If you capture information, you can find so many things that we can really understand better about our creative process to be able to drive efficiency and value into the entire business.
Wike: It becomes so much more important, too, as we become a more mobile workforce. 

Now it becomes imperative to be able to obtain the information about what those artists are doing so that they can collaborate. We know what value we are really getting from that, and so much information is available now. If you capture it, you can find so many things that we can really understand better about our creative process to be able to drive efficiency and value into the entire business.

Gardner: Before we close out, maybe a look into the crystal ball. With things like auto-scaling and composable infrastructure, where do we go next with computing infrastructure? As you say, it's now all these great screens in people's hands, handling high-definition, all the networks are able to deliver that, clearly almost an unlimited opportunity to bring entertainment to people. What can you now do with the flexible, efficient, optimized infrastructure? What should we expect?

Wike: There's an explosion in content and explosion in delivery platforms. We are exploring all kinds of different mediums. I mean, there’s really no limit to where and how one can create great imagery. The ability to do that, the ability to not say “No” to any project that comes along is going to be a great asset.

We always say that we don't know in the future how audiences are going to consume our content. We just know that we want to be able to supply that content and ensure that it’s the highest quality that we can deliver to audiences worldwide.

Gardner: It sounds like you feel confident that the infrastructure you have in place is going to be able to accommodate whatever those demands are. The art and the economics are the variables, but the infrastructure is not.

Wike: Having a software-defined environment is essential. I came from the software side; I started as a programmer, so I am coming back into my element. I really believe that now that you can compose infrastructure, you can change things with software without having to have people go in and rewire or re-stack, but instead change on-demand. And with machine learning, we’re able to learn what those demands are.

I want the computers to actually optimize and compose themselves so that I can rest knowing that my infrastructure is changing, scaling, and flexing in order to meet the demands of whatever we throw at it.