Security in Hadoop

by Jason Schlesinger

Data is growing at an increasing rate, and processing and storing that data is a real issue that present and future generations will have to deal with. Hadoop, Apache's open source implementation of Google's MapReduce, can scale both storage space and processing power almost indefinitely across a large dataset. This is achieved by how Hadoop distributes data across its nodes, and then that it distributes the work out to the nodes. The data is processed in manageable chunks by 'mappers', and then the results are aggregated, and processed as well by 'reducers'.


Hadoop is becoming a key business tool, due to its ability to manage processing large datasets. Companies like Yahoo, IBM, Facebook, New York Times, and e-Harmony are already using Hadoop to varying degrees for their needs already, and other companies are beginning to see the potential for Hadoop. The trend appears to be that Hadoop will become one of the leading platforms for processinglarge quantities of data.


Unfortunately, as of Version 0.19, Hadoop has security flaws that limit how data can be handled, and what kind of data can be handled. First, the file system that Hadoop runs on, HDFS, has no read control. Second, Hadoop authenticates a user for access control by using the output of the 'whoami' command, which is not secure. Third, HBase, which is the "database" that Hadoop uses, has no access control at all. Any company employing Hadoop needs to be aware of these issues, and apply security practices that work around how they deal with them.