Hadoop is one of the most popular big data frameworks in use today. Its popularity has actually made it synonymous with big data. Hadoop was one of the first frameworks of its kind which is one of the reasons why its adoption is so widespread.
Hadoop was created in 2006 by Yahoo. It was based on the Google File System and MapReduce. The company started using Hadoop on a 1000 node cluster the following year. Hadoop would then go on to be released as an open source project to Apache Software Foundation in 2008.
With Hadoop, it’s possible to store big data in a distributed environment. This allows for the data to be processed parallely. Hadoop has two core components that include the HDFS or Hadoop distributed File System which stores data of various formats across a cluster.
YARN is the second component and it’s tasked with resource management. It’s the component that handles all of the processing activities by allocating resources and scheduling tasks. It allows parallel processing over the data.
The ease at which Hadoop can be scaled up is one of its biggest advantages. The framework is based on the principle of horizontal scalability. It allows storage and distribution of big data across many different servers that operate in parallel.
It’s easy to add nodes to a Hadoop cluster on the fly which speeds up the scale at which the cluster size can grow.
Hadoop is open source. What that means is that the source code for Hadoop is available for free. Anyone can take the code and modify it to suit their specific requirement without any issues. This is one of the reasons why Hadoop remains such a widely used big data framework.
Hadoop is capable of processing very large amounts of data at incredible speeds, made possible by its distributed processing and storage architecture. The input data files are divided into blocks that are then stored over several nodes.
The tasks that are submitted by the user are also divided into sub-tasks that are assigned to worker nodes and are run in parallel.
Perturbed by small data
Hadoop is widely regarded as strictly a big data framework even though there are some other frameworks that work just as well with small data. Hadoop doesn’t run into issues even when handling a small number of very large files but it can’t deal with a large number of small files.
Any file that’s smaller than Hadoop’s block size, which can either be 128MB or 256B, can overload the Namenode and disrupt the framework’s function.
Hadoop has remained one of the most widely used big data frameworks despite the fact that there exist some security concerns.
Those concerns largely stem from the fact that Hadoop is written in Java which happens to be a very common programming language. It’s relatively easier for cyber criminals to exploit Java vulnerabilities.
Higher processing overhead
Hadoop is a batch processing engine at its core. All of the data is read via the disk and written to it as well. This can make the read and write operations quite expensive, particularly when the framework deals with petabytes of data.
This processing overhead happens because Hadoop is unable to perform in-memory calculations.