Tuesday, November 21, 2017

Inside story on HPC's role in the Bridges Research Project at Pittsburgh Supercomputing Center

The next BriefingsDirect Voice of the Customer high-performance computing (HPC) success story interview examines how Pittsburgh Supercomputing Center (PSC) has developed a research computing capability, Bridges, and how that's providing new levels of analytics, insights, and efficiencies.

We'll now learn how advances in IT infrastructure and memory-driven architectures are combining to meet the new requirements for artificial intelligence (AI), big data analytics, and deep machine learning.

Listen to the podcast. Find it on iTunes. Get the mobile app. Read a full transcript or download a copy.

Here to describe the inside story on building AI Bridges are Dr. Nick Nystrom, Interim Director of Research, and Paola Buitrago, Director of AI and Big Data, both at Pittsburgh Supercomputing Center. The discussion is moderated by Dana Gardner, principal analyst, at Interarbor Solutions.

Here are some excerpts:


Gardner: Let's begin with what makes Bridges unique. What is it about Bridges that is possible now that wasn't possible a year or two ago?

Nystrom
Nystrom: Bridges allows people who have never used HPC before to use it for the first time. These are people in business, social sciences, different kinds of biology and other physical sciences, and people who are applying machine learning to traditional fields. They're using the same languages and frameworks that they've been using on their laptops and now that is scaling up to a supercomputer. They are bringing big data and AI together in ways that they just haven't done before.

Gardner: It almost sounds like the democratization of HPC. Is that one way to think about it?

Nystrom: It very much is. We have users who are applying tools like R and Python and scaling them up to very large memory -- up to 12 terabytes of random access memory (RAM) -- and that enables them to gain answers to problems they've never been able to answer before.

Gardner: There is a user experience aspect, but I have to imagine there are also underlying infrastructure improvements that also contribute to user democratization.
We stay in touch with the user community and we look at this from their perspective. What are the applications that they need to run? What we came up with is a very heterogeneous system.

Nystrom: Yes, democratization comes from two things. First, we stay closely in touch with the user community and we look at this opportunity from their perspective first. What are the applications that they need to run? What do they need to do? And from there, we began to work with hardware vendors to understand what we had to build, and, what we came up with is a very heterogeneous system.

We have three tiers of nodes having memories ranging from 128 gigabytes to 3 terabytes, to 12 terabytes of RAM. That's all coupled on the same very-high-performance fabric. We were the first installation in the world with the Intel Omni-Path interconnect, and we designed that in a custom topology that we developed at PSC expressly to make big data available as a service to all of the compute nodes with equally high bandwidth, low latency, and to let these new things become possible.

Gardner: What other big data analytics benefits have you gained from this platform?

Buitrago
Buitrago: A platform like Bridges enables that which was not available before. There's a use case that was recently described by Tuomas Sandholm, [Professor and Director of the Electronic Marketplaces Lab at Carnegie Mellon University. It involves strategic machine learning using Bridges HPC to play and win at Heads-Up, No-limit Texas Hold'em poker as a capabilities benchmark.]

This is a perfect example of something that could not have been done without a supercomputer. A supercomputer enables massive and complex models that can actually give an accurate answer.

Right now, we are collecting a lot of data. There's a convergence of having great capabilities right in the compute and storage -- and also having the big data to answer really important questions. Having a system like Bridges allows us to, for example, analyze all that there is on the Internet, and put the right pieces together to answer big societal or healthcare-related questions.

Explore the New Path to
Computing

Gardner: The Bridges platform has been operating for some months now. Tell us some other examples or use cases that demonstrate its potential.

Dissecting disease through data

Nystrom: Paola mentioned use cases for healthcare. One example is a National Institutes of Health (NIH) Center of Excellence in the Big Data to Knowledge program called the Center for Causal Discovery.

They are using Bridges to combine very large data in genomics, such as lung-imaging data and brain magnetic resonance imaging (MRI) data, to come up with real cause-and-effect relationships among those very large data sets. That was never possible before because the algorithms were not scaled. Such scaling is now possible thanks very large memory architectures and because the data is available.

At CMU and the University of Pittsburgh, we have those resources now and people are making discoveries that will improve health. There are many others. One of these is on the Common Crawl data set, which is a very large web-scale data set that Paola has been working with.

Buitrago: Common Crawl is a data set that collects all the information on the Internet. The data is currently available on the Amazon Web Services (AWS) cloud in S3. They host these data sets for free. But, if you want to actually analyze the data, to search or create any index, you have to use their computing capabilities, which is a good option. However, given the scale and the size of the data, this is something that requires a huge investment.

So we are working on actually offering the same data set, putting it together with the computing capabilities of Bridges. This would allow the academic community at large to do such things as build natural language processing models, or better analyze the data -- and they can do it fast, and they can do it free of charge. So that's an important example of what we are doing and how we want to support big data as a whole.

Explore the New Path to
Computing Solutions

Gardner: So far we’ve spoken about technical requirements in HPC, but economics plays a role here. Many times we've seen in the evolution of technology that as things become commercially available off-the-shelf technologies, they can be deployed in new ways that just weren’t economically feasible before. Is there an economics story here to Bridges?

Low-cost access to research

Nystrom: Yes, with Bridges we have designed the system to be extremely cost-effective. That's part of why we designed the interconnect topology the way we did. It was the most cost-effective way to build that for the size of data analytics we had to do on Bridges. That is a win that has been emulated in other places.

So, what we offer is available to research communities at no charge -- and that's for anyone doing open research. It's also available to the industrial sector at essentially a very attractive rate because it’s a cost-recovery rate. So, we do work with the private sector. We are looking to do even more of that in future.

We're always looking at the best available technology for performance, for price, and then architecting that into a solution that will serve research.
Also, the future systems we are looking at will leverage lots of developing technologies. We're always looking at the best available technology for performance, for price, and then architecting that into a solution that will serve research.

Gardner: We’ve heard a lot recently from Hewlett Packard Enterprise (HPE) recently about their advances in large-scale memory processing and memory-driven architectures. How does that fit into your plans?

Nystrom: Large, memory-intensive architectures are a cornerstone of Bridges. We're doing a tremendous amount of large-scale genome sequence assembly on Bridges. That's individual genomes, and it’s also metagenomes with important applications such as looking at the gut microbiome of diabetic patients versus normal patients -- and understanding how the different bacteria are affected by and may affect the progression of diabetes. That has tremendous medical implications. We’ve been following memory technology for a very long time, and we’ve also been following various kinds of accelerators for AI and deep learning.

Gardner: Can you tell us about the underlying platforms that support Bridges that are currently commercially available? What might be coming next in terms of HPE Gen10 servers, for example, or with other HPE advances in the efficiency and cost reduction in storage? What are you using now and what do you expect to be using in the future?

Ever-expanding memory, storage

Nystrom: First of all, I think the acquisition of SGI by HPE was very strategic. Prior to Bridges, we had a system called Blacklight, which was the world’s largest shared-memory resource. It’s what taught us, and we learned how productive that can be for new communities in terms of human productivity. We can’t scale smart humans, and so that’s essential.

In terms of storage, there are tremendous opportunities now for integrating storage-class memory, increasing degrees of flash solid-state drives (SSDs), and other stages. We’ve always architected our own storage systems, but now we are working with HPE to think about what we might do for our next round of this.

Gardner: For those out there listening and reading this information, if they hadn’t thought that HPC and big data analytics had a role in their businesses, why should they think otherwise?

Nystrom: From my perspective, AI is permeating all aspects of computing. The way we see AI as important in an HPC machine is that it is being applied to applications that were traditionally HPC only -- things like weather and protein folding. Those were apps that people used to run on just big iron.

These will be enterprise workloads where AI has a key impact. They will use AI as an empowering tool to make what they already do, better.
Now, they are integrating AI to help them find rare events, to do longer-term simulations in less time. And they’ll be doing this across other industries as well. These will be enterprise workloads where AI has a key impact. It won’t necessarily turn companies into AI companies, but they will use AI as an empowering tool to make what they already do, better.

Gardner: An example, Nick?

Nystrom: A good example of the way AI is permeating other fields is what people are doing at the Institute for Precision Medicine, [a joint effort between the University of Pittsburgh and the University of Pittsburgh Medical Center], and the Carnegie Mellon University Machine Learning and Computational Biology Departments.

They are working together on a project called Big Data for Better Health. Their objective is to apply state of the art machine learning techniques, including deep learning, to integrated genomic patient medical records, imaging data, and other things, and to really move toward realizing true personalized medicine.

Gardner: We’ve also heard a lot recently about hybrid IT. Traditionally HPC required an on-premises approach. Now, to what degree does HPC-as-a-service make sense in order to take advantage of various cloud models?

Explore the New Path to
Computing

Nystrom: That’s a very good question. One of the things that Bridges makes available through the democratizing of HPC is big data-as-a-service and HPC-as-a-service. And it does that in many cases by what we call gateways. These are web portals for specific domains.

At the Center for Causal Discovery, which I mentioned, they have the Causal Web. It’s a portal, it can run in any browser, and it lets people who are not experts with supercomputers access Bridges without even knowing they are doing it. They run applications with a supercomputer as the back-end.

Another example is Galaxy Project and Community Hub, which are primarily for bioinformatic workflows, but also other things. The main Galaxy instance is hosted elsewhere, but people can run very large memory genome assemblies on Bridges transparently -- again without even knowing. They don’t have to log in, they don’t have to understand Linux; they just run it through a web browser, and they can use HPC-as-a-service. It becomes very cloud-like at that point.

Super-cloud supercomputing

Cloud and traditional HPC are complimentary among different use cases, for what's called for in different environments and across different solutions.
Buitrago: Depending on the use case, an environment like the cloud can make sense. HPC can be used for an initial stage, if you want to explore different AI models, for example. You can fine-tune your AI and benefit from having the data close. You can reduce the time to start by having a supercomputer available for only a week or two. You can find the right parameters, you get the model, and then when you are actually generating inferences you can go to the cloud and scale there. It supports high peaks in user demand. So, cloud and traditional HPC are complimentary among different use cases, for what’s called for in different environments and across different solutions.

Gardner: Before we sign off, a quick look to the future. Bridges has been here for over a year, let's look to a year out. What do you expect to come next?

Nystrom: Bridges has been a great success. It's very heavily subscribed, fully subscribed, in fact. It seems to work; people like it. So we are looking to build on that. We're looking to extend that to a much more powerful engine where we’ve taken all of the lessons we've learned improving Bridges. We’d like to extend that by orders of magnitude, to deliver a lot more capability -- and that would be across both the research community and industry.

Gardner: And using cloud models, what should look for in the future when it comes to a richer portfolio of big data-as-a-service offerings?

Buitrago: We are currently working on a project to make data more available to the general public and to researchers. We are trying to democratize data and let people do searches and inquiries and processing that they wouldn’t be able to do without us.

We are integrating big data sets that go from web crawls to genomic data. We want to offer them paired with the tools to properly process them. And we want to provide this to people who haven’t done this in the past, so they can explore their questions and try to answer them. That’s something we are really interested in and we look forward to moving into a production stage.


Listen to the podcast. Find it on iTunes. Get the mobile app. Read a full transcript or download a copy. Sponsor: Hewlett Packard Enterprise.

You may also be interested in:

Monday, November 20, 2017

How UBC gained TCO advantage via flash for its EduCloud cloud storage service

The next BriefingsDirect cloud efficiency case study explores how a storage-as-a-service offering in a university setting gains performance and lower total cost benefits by a move to all-flash storage.

We’ll now learn how the University of British Columbia (UBC) has modernized its EduCloud storage service and attained both efficiency as well as better service levels for its diverse user base.

Listen to the podcast. Find it on iTunes. Get the mobile app. Read a full transcript or  download a copy.

Here to help us explore new breeds of SaaS solutions is Brent Dunington, System Architect at UBC in Vancouver. The discussion is moderated by Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: How is satisfying the storage demands at a large and diverse university setting a challenge? Is there something about your users and the diverse nature of their needs that provides you with a complex requirements list? 

Dunington
Dunington: A university setting isn't much different than any other business. The demands are the same. UBC has about 65,000 students and about 15,000 staff. The students these days are younger kids, they all have iPhones and iPads, and they just want to push buttons and get instant results and instant gratification. And that boils down to the services that we offer.

We have to be able to offer those services, because as most people know, there are choices -- and they can go somewhere else and choose those other products.

Our team is a rather small team. There are 15 members in our team, so we have to be agile, we have to be able to automate things, and we need tools that can work and fulfill those needs. So it's just like any other business, even though it’s a university setting.

HPE
Flash Performance

Gardner: Can you give us a sense of the scale that describes your storage requirements?

Dunington: We do SaaS, we also do infrastructure-as-a-service (IaaS). EduCloud is a self-service IaaS product that we deliver to UBC, but we also deliver it to 25 other higher institutions in the Province of British Columbia.

We have been doing IaaS for five years, and we have been very, very successful. So more people are looking to us for guidance.

Because we are not just delivering to UBC, we have to be up running and always able to deliver, because each school has different requirements. At different times of the year -- because there is registration, there are exam times -- these things have to be up. You can’t not be functioning during an exam and have 600 students not able to take the tests that they have been studying for. So it impacts their life and we want to make sure that we are there and can provide the services for what they need.

Gardner: In order to maintain your service levels within those peak times, do you in your IaaS and storage services employ hybrid-cloud capabilities so that you can burst? Or are you doing this all through your own data center and your own private cloud?

On-Campus Cloud

Dunington: We do it all on-campus. British Columbia has a law that says all the data has to stay in Canada. It’s a data-sovereignty law, the data can't leave the borders.

That's why EduCloud has been so successful, in my opinion, because of that option. They can just go and throw things out in the private cloud.

The public cloud providers are providing more services in Canada: Amazon Web Services (AWS) and Microsoft Azure cloud are putting data centers in Canada, which is good and it gives people an option. Our team’s goal is to provide the services, whether it's a hybrid model or all on-campus. We just want to be able to fulfill those needs.

Gardner: It sounds like the best of all worlds. You are able to give that elasticity benefit, a lot of instant service requirements met for your consumers. But you are starting to use cloud pay-as-you-go types of models and get the benefit of the public cloud model -- but with the security, control and manageability of the private clouds.

What decisions have you made about your storage underpinnings, the infrastructure that supports your SaaS cloud?

Dunington: We have a large storage footprint. For our site, it’s about 12 petabytes of storage. We realized that we weren’t meeting the needs with spinning disks. One of the problems was that we had runaway virtual workloads that would cause problems, and they would impact other services. We needed some mechanism to fix that.

We wanted to make sure that we had the ability to attain quality of service levels and control those runaway virtual machines in our footprint.
We went through the whole request for proposal (RFP) process, and all the IT infrastructure vendors responded, but we did have some guidelines that we wanted to go through. One of the things we did is present our problems and make sure that they understood what the problems were and what they were trying to solve.

And there were some minimum requirements. We do have a backup vendor of choice that they needed to merge with. And quality of service is a big thing. We wanted to make sure that we had the ability to attain quality of service levels and control those runaway virtual machines in our footprint.

Gardner: You gained more than just flash benefits when you got to flash storage, right?

Streamlined, safe, flash storage

Dunington: Yes, for sure. With an entire data center full of spinning disks, it gets to the point where the disks start to manage you; you are no longer managing the disks. And the teams out there changing drives, removing volumes around it, it becomes unwieldy. I mean, the power, the footprint, and all that starts to grow.

Also, Vancouver is in a seismic zone, we are right up against the Pacific plate and it's a very active seismic area. Heaven forbid anything happens, but one of the requirements we had was to move the data center into the interior of the province. So that was what we did.

When we brought this new data center online, one of the decisions the team made was to move to an all-flash storage environment. We wanted to be sure that it made financial sense because it's publicly funded, and also improved the user experience, across the province.

Gardner: As you were going about your decision-making process, you had choices, what made you choose what you did? What were the deciding factors?

Dunington: There were a lot of deciding factors. There’s the technology, of being able to meet the performance and to manage the performance. One of the things was to lock down runaway virtual machines and to put performance tiers on others.

But it’s not just the technology; it's also the business part, too. The financial part had to make sense. When you are buying any storage platform, you are also buying the support team and the sales team that come with it.

Our team believes that technology is a certain piece of the pie, and the rest of it is relationship. If that relationship part doesn't work, it doesn’t matter how well the technology part works -- the whole thing is going to break down.

Because software is software, hardware is hardware -- it breaks, it has problems, there are limitations. And when you have to call someone, you have to depend on him or her. Even though you bought the best technology and got the best price -- if it doesn't work, it doesn’t work, and you need someone to call.

So those service and support issues were all wrapped up into the decision.

HPE
Flash Performance

We chose the Hewlett Packard Enterprise (HPE) 3PAR all-flash storage platform. We have been very happy with it. We knew the HPE team well. They came and worked with us on the server blade infrastructure, so we knew the team. The team knew how to support all of it. 

We also use the HPE OneView product for provisioning, and it integrated into that all. It also supported the performance optimization tool (IT Operations Management for HPE OneView) to let us set those values, because one of the things in EduCloud is customers choose their own storage tier, and we mark the price on it. So basically all we would do is present that new tier as new data storage within VMware and then they would just move their workloads across non-disruptively. So it has worked really well.

The 3PAR storage piece also integrates with VMware vRealize Operations Manager. We offer that to all our clients as a portal so they can see how everything is working and they can do their own diagnostics. Because that’s the one goal we have with EduCloud, it has to be self-service. We can let the customers do it, that's what they want.

Gardner: Not that long ago people had the idea that flash was always more expensive and that they would use it for just certain use-cases rather than pervasively. You have been talking in terms of a total cost of ownership reduction. So how does that work? How does the economics of this over a period of time, taking everything into consideration, benefit you all?

Economic sense at scale

Dunington: Our IT team and our management team are really good with that part. They were able to break it all down, and they found that this model would work at scale. I don’t know the numbers per se, but it made economic sense.

Spinning disks will still have a place in the data center. I don't know a year from now if an all-flash data center will make sense, because there are some records that people will throw in and never touch. But right now with the numbers on how we worked it out, it makes sense, because we are using the standard bronze, the gold, the silver tiers, and with the tiers it makes sense.

The 3PAR solution also has dedupe functionality and the compression that they just released. We are hoping to see how well that trends. Compression has only been around for a short period of time, so I can’t really say, but the dedupe has done really well for us.

Gardner: The technology overcomes some of the other baseline economic costs and issues, for sure.

We have talked about the technology and performance requirements. Have you been able to qualify how, from a user experience, this has been a benefit?

Dunington: The best benchmark is the adoption rate. People are using it, and there are no help desk tickets, so no one is complaining. People are using it, and we can see that everything is ramping up, and we are not getting tickets. No one is complaining about the price, the availability. Our operational team isn't complaining about it being harder to manage or that the backups aren’t working. That makes me happy.

The big picture

Gardner: Brent, maybe a word of advice to other organizations that are thinking about a similar move to private cloud SaaS. Now that you have done this, what might you advise them to do as they prepare for or evaluate a similar activity?

Not everybody needs that speed, not everybody needs that performance, but it is the future and things will move there.
Dunington: Look at the full picture, look at the total cost of ownership. There’s the buying of the hardware, and there's also supporting the hardware, too. Make sure that you understand your requirements and what your customers are looking for first before you go out and buy it. Not everybody needs that speed, not everybody needs that performance, but it is the future and things will move there. We will see in a couple of years how it went.

Look at the big picture, step back. It’s just not the new shiny toy, and you might have to take a stepped approach into buying, but for us it worked. I mean, it’s a solid platform, our team sleeps well at night, and I think our customers are really happy with it.

Gardner: This might be a little bit of a pun in the education field, but do your homework and you will benefit.

HPE
Flash Performance

Dunington: Yes, for sure.

Listen to the podcast. Find it on iTunes. Get the mobile app. Read a full transcript or  download a copy. Sponsor: Hewlett Packard Enterprise.

You may also be interested in: