Storage Hangout: Hadoop Plug-in Refresh Release and Ambari Project with Erin Boyd

In today’s Storage Hangout, Brian catches up with Erin Boyd, principle software engineer of Red Hat Big Data to learn about the Hadoop Plug-in Refresh Release. You can see the release information in detail here.

Q: Tell us more about the Hadoop Plug-in Refresh release. What can the big data community expect from this new release?

A: For the latest release of the plugin we have joined forces with our partner, Hortonworks, to release their 2.1 version of Hortonworks Data Platform for Hadoop using Ambari.

Q: What is Ambari and why should organizations use it?

A: Ambari is the only open source Hadoop management tool in the ASF and the preferred tool used by Hortonworks to release their Data Platform. Ambari is used to provision, manage and monitor Hadoop clusters. Ambari provides a wizard to easily deploy clusters across many different hosts, allowing the user to select and configure their Hadoop services taking a lot of the guesswork out of the equation. By providing the services in a ‘stack’ configuration, the user can be confident the services they select will work properly when deployed together.

It also provides a rich REST API that developers can use to integrate into their own management and monitoring of applications.

For this release, my team was able to take the base Hortonworks Data Platform stack, or HDP stack, and expand it to use an alternative storage to the traditional HDFS storage. By offering to our customers the ability to run analytics on their data in place on Red Hat Storage, customers will not longer have to have their data ingested into Hadoop. It’s a huge win.

In addition, the Hadoop ecosystem is in a constant state of flux. New services are being introduced often to address different big data use cases. Where once Hadoop was synonymous with MapReduce, this isn’t the only way to analyze big data sets.

Q: You mentioned the Hadoop landscape is changing…how is Ambari embracing these changes

New services in HDP 2.1.GlusterFS

  • New services from tech preview
    • Tez
    • Hbase
  • Security
    • Identity Manager integration for the application of LDAP and Kerberos
  • User experience
    • Simplified installation experience
    • Improved presence of GlusterFS on UI for better customization

The Ambari community has been a leader expanding the traditional Hadoop stack to encompass lots of new technologies. As in the 2.1 version we are releasing today with the Hadoop plug-in refresh release, we have services like Tez, Hive and HBase to name a few along with the traditional MapReduce services to expand the options to our big data customers. We have expanded the stack from our tech preview release a few months ago by 5 new services.

We have also been leaders in implementing a Hadoop Compatible File System adoption by the Hadoop community. Another Apache project, Bigtop, of whom our team has a committer, Jay Vyas, has created a standard for non-HDFS file systems. Bigtop is helping push the envelope on standardizing the services and continuous integration and testing of them. It’s an exciting project and we are glad to be part of it.
In the next release of Ambari, it will have the ability to deploy a Bigtop stack as well.

Q: How much does Red Hat participate in the Ambari project?

A: Currently I lead a team of 3 other developers all of whom are Ambari committers. Red Hat has the second most committers behind Hortonworks for the Ambari project.

Q: What are some new features we can expect to see in Ambari in the coming months?

A: There are some huge changes going on in Ambari we can expect to release in the first quarter of 2015. One of the features I think is really going to be a win for Storage customers is the release of Ambari views. Views create a more user centric feature to Ambari that was previously missing. It allows us to create plug-able interfaces within Ambari focused on what our customers want and need to see to better understand their data and issues in their cluster. It will allow us to help our Storage customers get the most out of Hadoop. In addition, the next release will have 7 new services to continue to expand analytics capability and create more secure clusters. It is a true testament to the agility of the Ambari community to embrace and implement new technologies right on the cutting edge. We are excited to continue partnering with Hortonworks to bring these technologies to our Red Hat customers.


Announcing Hortonworks Data Platform 2.1. on Red Hat Storage 3.0.2


We are pleased to announce the availability of the Hortonworks Data Platform (HDP) 2.1 on Red Hat Storage 3.0.2. This joint solution between Red Hat and Hortonworks allows you to bring the full analytical power of the Hadoop Ecosystem to the data within your Red Hat Storage cluster.

Red Hat Storage (RHS) is a great choice for a central storage repository. The fact that it is a software-defined, scale out architecture and runs on commodity hardware helps reduce your total cost of ownership, but it also simplifies managing storage capacity and allows you to grow your cluster at your own pace. For these reasons we think that eventually you’ll end up with a significant amount of data within your RHS cluster and you’ll likely want to begin analyzing it.

It is common for RHS users to start with Python and R analysis on your RHS data because they perform well and work intuitively with a distributed filesystem that is POSIX compliant and written in C. However, it is also common to want to explore using Hadoop because of the analytical power it brings with parallel processing, its ability to simplify ETL and the opportunity it offers to be used as a lower cost data warehouse. Rather than procuring new software and hardware infrastructure for an additional Hadoop cluster, one can actually use Hadoop directly on the data within the RHS cluster by installing the Hortonworks Hadoop Distribution directly on the RHS cluster. This solution configures RHS as the storage tier for Hadoop in lieu of HDFS. The solution works in a similar manner to how HDP would normally work with HDFS due to the fact that the HDP components are collocated on the same servers as the RHS components. This allows Hadoop to maintain data locality when scheduling analytical workloads.

Given that we initially shipped HDP 2.0.6 support with the GA release of RHS 3.0 in September, HDP 2.1 support marks the 2nd release of our joint solution with Hortonworks. In this release we’ve added support for Tez and HBase, we’ve augmented the security capabilities by validating the solution with RHEL Identity Management (LDAP and Kerberos integration) and we have also improved the user experience by simplifying the installation and configuration process.

If you’re interested in trying this out, simply follow the instructions in Chapter 7 of the RHS Installation Guide.

Questions and Answers, part one: Red Hat Storage Server 3, Ceph, and Gluster

You may recall we recently launched Red Hat Storage Server 3 (learn more about that here). Well, we had a lot of questions arise during our keynote that we weren’t able to share at the time due to the live nature of the broadcast. Well, we’ve collected all the questions we received and compiled a series of blog posts around them. We’ll be sharing them over the next few days…starting with this post.

Tweet 5_Whats New

Competitor Comparison

What proof points do you have for the storage TCO compared to either legacy methods or your competitors?

IDC published a white paper that details this information. Check out The Economics of Software-based Storage report.

How is RHS Ceph on commodity hardware positioned for advantage against big storage vendors like NetApp, EMC, HDS, etc…

Red Hat and open source gives customers freedom of choice on hardware which helps them to drive down costs. The scale-out architectures of Red Hat Storage Server and Red Hat’s InkTank Ceph Enterprise are better suited for massive data growth without having to invest upfront. In addition, as we run on industry-standard hardware and combine Red Hat Enterprise Linux with GlusterFS as the underlying storage OS the storage nodes can also be used to run some infrastructure applications which helps to reduce datacenter footprint and costs.

How does this differ from Netapp & EMC?  What advantages does it have over these big storage players?

NetApp has no real scale-out storage solution. EMC does have Isilon for scale-out file storage which is a proprietary appliance with similar features, but at significantly higher costs. Red Hat Storage also provides converged (computing and storage) capabilities, whereas NetApp and EMC do not.

Can you provide Red Hat’s definition for Software Defined Storage and how the capacity and security mechanisms in SDS are improved & differentiated compared to traditional storage?

Software-defined storage decouples the storage hardware and the control layer. Also the common SDS regime is to use standard hardware and develop advanced features rather in the software layer. RHS uses RHEL as the underlying storage OS which provides military grade security features. Using a mainstream OS like RHEL also means that overall more customers are using it and when security issues are discovered that they can receive fixes immediately. Just as a recent example it took RH a couple of hours to fix a very wide-spread and dangerous security issue in the bash shell (shellshock) whereas there are quite a few UNIX, linux or freebsd based deployments or appliances which still don’t have a fix for this issue as of today.

How do you measure advantage your system provides?

We can measure three factors: Costs, Scalability and Performance. Our costs are on average up to 60% lower than our competition, we can now scale up to >30 PB in a single pool and linearly increase throughput by just adding additional storage nodes.

What are the advantages of RHSSv3 compared to ZFS solutions like Nexenta?

ZFS is a great file system but it’s running on a single-node and therefore can’t really scale-out but rather uses the scale-up approach (adding CPU, cache, SSD). This is fine for small to medium environments, but brings the same limitations as proprietary legacy storage appliances. They don’t scale.

How is software-defined storage different from storage hypervisors or storage virtualization?

Red Hat Storage Server and Red Hat’s Inktank Ceph Enterprise use virtualization approaches as well but they go beyond that capability and provide many more features. Also the most commonly used storage virtualization technologies are block-based and provide just larger virtual block pools which usually require more expensive and complex fibre-channel based storage. RHS pools and virtualizes filesystems based on commodity storage servers where there is no need for a shared storage system or fibre-channel, but rather uses the disks which are in each of the storage servers. We use an algorithmic approach for virtualization which tends to scale better than classic storage virtualization technologies which have to go through a controller appliance or metadata server.


Storage Hangout: Live from Strata + Hadoop World – Barcelona, Spain (11/20)

Tune into the latest episode of the Storage Hangout, broadcasting live from Strata + Hadoop World in Barcelona, Spain. Brian hangs out with Greg Kleiman, Director of Red Hat Big Data, about the latest news and announcements coming out of Strata+Hadoop World conference.

Greg touches on the long term vision for big data analytics, the partnership alliances with Cloudera and Hadoop, and provides a glimpse into the future 2015 Red Hat Big Data strategy.

BrightTALK Webinar Highlights: The New Shape of Software Defined Storage

slide3_001Did you have a chance to catch the BrightTALK webinar: “The New Shape of Software Defined Storage?” If not, then you missed out. It was hosted by the Taneja Group Sr. Analyst Mike Matchett, and it explored the rapidly expanding world of Software Defined Storage with a panel of the hottest SDS vendors in the market including:

  • Gridstore: Founder & CTO: Kelly Murphy
  • HP: Director, Product Marketing & Management – SDS: Rob Strechay
  • IBM: Director, Storage and SDE Strategy: Sam Werner
  • Red Hat: Product Marketing Lead, Big Data: Irshad Raihan

Some of the highlights from the webinar, which Irshad touches on includes:

  • New workloads are driving innovation in software defined storage. With the acquisition of Inktank, Red Hat is now poised to tackle any type of semi or unstructured data across multiple formats such as file, block, and object.
  • The latest release, Red Hat Storage Server 3.0 combines the best in class enterprise grade Linux Server platform Red Hat Enterprise Linux 6 and GlusterFS 3.6 to create an open software-defined, massively scalable, high performance, highly-available and cost-effective storage offering.
  • This release of the storage server introduces volume snapshots, monitoring using Nagios, increased usable capacity per storage server, hadoop workload support, non-disruptive upgrade from the previous version and supportability enhancements to address your data protection and storage management challenges for unstructured and big data storage.
  • The storage server is rigorously qualified to meet exacting performance and scale demands for next generation enterprise and cloud storage deployments and is tightly integrated with Red Hat’s Scalable File-system (XFS file system).
  • Key features of the latest RHSS3 release:
    • Local snapshots for disk based backup
    • Monitoring using Nagios
    • In place analytics for Hadoop workloads
    • Non disruptive management and upgrades

To watch the webinar on-demand, tune in here.


Get every new post delivered to your Inbox.

Join 3,125 other followers

%d bloggers like this: