Big Data as we know it..: 2015

Wednesday, September 16, 2015

Spark

What is Apache Spark?

Apache Spark is an open-source distributed general-purpose cluster computing framework with in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL.

The history..

Spark is an open source project that has been built and is maintained by a thriving and diverse community of developers. Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. It was observed that MapReduce was inefficient for some iterative and interactive computing jobs, and Spark was designed in response. Spark’s aim is to be fast for interactive queries and iterative algorithms, bringing support for in-memory storage and efficient fault recovery. Iterative algorithms have always been hard for MapReduce, requiring multiple passes over the same data.

Friday, September 4, 2015

Hadoop Version 1 Commands

Print the Hadoop version


hadoop version

List the contents of the root directory in HDFS


hadoop fs -ls /

List all the hadoop file system shell commands

hadoop fs

'help' command in hadoop

hadoop fs -help

To display the amount of space used and available on the HDFS

hadoop dfsadmin -report OR hadoop fs -df -h

Count the number of directories, files and bytes under the paths that match the specified file pattern


hadoop fs -count hdfs:/

Run DFS filesystem checking utility


hadoop fsck – /

Run cluster balancing utility


hadoop balancer

Create a new directory named “hadoop” in your home directory


hadoop fs -mkdir /user/UserName/hadoop

Add a sample text file from the local directory named “data” to the new directory you created in HDFS (in above step)


hadoop fs -put data/sample.txt /user/UserName/hadoop

List the contents of this new directory in HDFS


hadoop fs -ls /user/UserName/hadoop

Add an entire local directory to the "/user/UserName" directory in HDFS


hadoop fs -put data/someDir /user/UserName/hadoop

Any command that does not have an absolute path is interpreted as relative to the home directory. To list all the files in your home directory


hadoop fs -ls

Check how much space "localDir" directory occupies in HDFS


hadoop fs -du -s -h hadoop/localDir

Delete a file ‘someFile’ from HDFS


hadoop fs -rm hadoop/localDir/someFile

Delete all files in a directory using a wildcard


hadoop fs -rm hadoop/localDir/*

To empty the trash


hadoop fs -expunge

Remove the entire "localDir" directory and all of its contents in HDFS


hadoop fs -rm -r hadoop/localDir

Copy a local file "someFile.txt" to a directory you created in HDFS
(relative path is used in below example)

hadoop fs -copyFromLocal /home/UserName/someFile.txt hadoop/

To view the contents of your text file "someFile.txt" which is present in the hadoop directory (on HDFS)


hadoop fs -cat hadoop/someFile.txt

Copy the "someFile.txt" from HDFS to a “data” directory in the local directory


hadoop fs -copyToLocal hadoop/someFile.txt /home/UserName/data

'cp' is used to copy files between directories present in HDFS
(all .txt files are copied below)


hadoop fs -cp /user/UserName/*.txt /user/UserName/hadoop

‘get’ command may be used as an alternative to ‘-copyToLocal’ command


hadoop fs -get hadoop/sample.txt /home/UserName/

To display last kilobyte of the file “someFile.txt”


hadoop fs -tail hadoop/someFile.txt

‘chmod’ command may be used to change permissions of a file
(default permissions are 666)

hadoop fs -chmod 600 hadoop/someFile.txt

‘chown’ may be used to change the owner and group of the file

hadoop fs -chown root:root hadoop/someFile.txt

‘chgrp’ command to change group name

hadoop fs -chgrp training hadoop/someFile.txt

'mv' command to move a directory from one location to other


hadoop fs -mv old_loc new_loc

'setrep' may be used to set the replication factor of a file
(Default replication factor of a file is 3, in the below example we set it to 2)

hadoop fs -setrep -w 2 hadoop/someFile.txt

To copy a directory from one node in the cluster to another use ‘distcp’ command
(use -overwrite option to overwrite in an existing files and -update command to synchronize both directories)


hadoop fs -distcp hdfs://namenodeA/hadoop hdfs://namenodeB/hadoop

Command to make the name node leave safe mode


hdfs dfsadmin -safemode leave

Sunday, August 23, 2015

Hadoop

What is Apache Hadoop?

Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed(i.e. code goes to the data).

Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on their MapReduce and Google File System.

What does core Apache Hadoop framework consist of?

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common – contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
Hadoop Yet Another Resource Negotiator(YARN) (hadoop v2.0+) – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications
Hadoop MapReduce – a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

However, Hadoop today refers to a whole ecosystem which includes additional software packages(or plugins) to Apache Hadoop like Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm, etc which help is easier ingestion, processing, storage and management of data.

Apache Hadoop Architecture

Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and the Hadoop Distributed File System (HDFS).

Hadoop packages/jars are to be present on all the nodes in the Hadoop cluster and ssh connections(password-less ssh) is to be establish amongst them. Hadoop requires Java Runtime Environment (JRE) 1.6 or higher.

Generally, a Hadoop Cluster is managed by a Master node which is the Name Node(server to host the file system index) and also also contains the Job Tracker(server which manages job scheduling across nodes). All other nodes(slave/worker) in the cluster form the Data Nodes(the data storage) and also contain the Task Trackers(which execute the MapReduce Jobs on the Data). Additionally there is a Secondary NameNode set up which holds a snapshot of the namenode's memory structures, thereby preventing file-system corruption and loss of data.

HDFS stores large files across multiple machines. It achieves reliability by replicating the data across multiple hosts. The default replication factor(value) is 3, so data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes communicate with each other to re-balance data and to keep the replication of data high.

HDFS was designed for mostly immutable files(i.e. files without updates or changes) and may not be suitable for systems requiring concurrent write-operations. This is because when a seek and update is to be done, first the location information is to be received from the name node then the file system block is to be located, a buffer chunk of blocks(usually much more than what is required) is read in and then the operation is performed. So when a specific position in a block is to be access from the HDFS, many blocks maybe read off the disk then one may expect.

The MapReduce Engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides (or the processing takes too long), priority is given to nodes in the same rack. This reduces network traffic on the main backbone network.

The HDFS file system is not restricted to MapReduce jobs. It maybe used with multiple other package depending on requirement. It best works for batch processing but can be used to complement a real-time system such as Apache Storm, Spark or Flink in a Lambda or Kappa Architecture.

Hadoop Distributions

Apache Hadoop - A standard open source Hadoop distribution consists of the Hadoop Common package, MapReduce Engine and HDFS.

Vendor distributions are designed to overcome issues with the open source edition and provide additional value to customers, with focus on things such as:

Reliability - Vendors promptly deliver fixes and patches to bugs.

Support - They provide technical assistance, which makes it possible to adopt the platforms for enterprise-grade tasks.

Completeness - Very often Hadoop distributions are supplemented with other tools to address specific tasks and vendors provide compatible plugins to supplement plugins which maybe necessary for various tasks.

Most popular Hadoop Distributions(as of 2015):

Hortonworks

Cloudera

MapR

There are many other distributions available, few of then even have cloud implementation of Hadoop. A detailed list maybe found here: Distributions and Commercial Support

Commercial Applications of Hadoop

Few of the commercial applications of Hadoop are(source):

Log and/or clickstream analysis of various kinds

Marketing analytics

Machine learning and/or sophisticated data mining

Image processing

Processing of XML messages

Web crawling and/or text processing

General archiving, including of relational/tabular data.

Hadoop commands reference:

Hadoop version 1 commands
Hadoop version 2 commands(page is being updated)

Wednesday, August 12, 2015

Big Data & Analytics!

What is "Big Data"?

Google says big data is "extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions".

Wikipedia says big data is "a term for data sets that are so large or complex that traditional data processing applications are inadequate".

In simple words, its just lots of data. Data maybe structured, semi-structured or unstructured.

What's the "Analytics" in Big Data Analytics?

SAS says , "Big data analytics examines large amounts of data to uncover hidden patterns, correlations and other insights".

In simple words, deriving insights(useful info) from large data-sets. Big data analytics helps organizations harness their data and use it to identify new opportunities. That, in turn, leads to smarter business moves, more efficient operations, higher profits and happier customers.

How BIG is "Big Data"?

Today, big data maybe a few petabytes(2⁵⁰ bytes) of data but in a few year it maybe in zettabytes(2⁷⁰ bytes) and then maybe in Yottabytes(2⁸⁰ bytes).
So, we may not strictly quantify "Big Data".

What is the Big Data problem?

In 2001, Gartner analyst Doug Laney described three dimensions of data management challenges. This characterization, which addresses volume, velocity, and variety, is frequently documented in scientific literature.Commonly known as the 3 V's can be understood as follows:

Volume refers to the size of the data. The massive volumes of data that we are currently dealing with has required scientists to rethink storage and processing paradigms in order to develop tools needed to properly analyze it.
Velocity refers to the speed at which data can be received and analyzed. Here, we consider two types of processing, batch processing which is processing of historical data and real-time processing which deals with processing streams of data in real time(actually near real-time).
Variety refers to the issue of disparate and incompatible data formats. Data can come in from many different sources and take on many different forms, and just preparing it for analysis takes a significant amount of time and effort.

Additionally, a forth V is also said to be a factor of the Big Data problem. The forth V is Veracity which refers to the uncertainty of data. An info-graphic representing this is shown below:

Source: IBM BigDataHub

Why do I hear "Big Data" so often these days?

Big data is not something that started a few days ago. Its just that we have started realizing its importance and how it could be used to generate useful insights.
Big Data & Analytics can help companies in many ways like:

Reduce Expenses
Better decision meeting
Help in launching new features and products

What are the tools or frameworks associated with Big Data?

Various tools and frameworks related to Big Data are:

Hadoop
Spark
Storm
NoSQL Databases

A brief history:

<< Posts related to various topics coming up soon >>

Monday, January 12, 2015

Yarn Commands

Yarn commands are invoked by the bin/yarn script. Running the yarn script without any arguments prints the description for all commands.

 Usage: yarn [--config confdir] COMMAND

jar

Runs a jar file. Users can bundle their Yarn code in a jar file and execute it using this command.

 Usage: yarn jar <jar> [mainClass] args...



application

Prints application(s) report/kill application


  Usage: yarn application <options>




  
<options>  Description
-list  Lists applications from the RM. Supports optional use of -appTypes to filter applications based on application type, and -appStates to filter applications based on application state.
-appStates States  Works with -list to filter applications based on input comma-separated list of application states. The valid application state can be one of the following:  
ALL, NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED
-appTypes Types  Works with -list to filter applications based on input comma-separated list of application types.
-status <ApplicationId>  Prints the status of the application based on <ApplicationId>
-kill <ApplicationId>  Kills the application <ApplicationId>

node

Prints node report(s)

  Usage: yarn node <options>

<options>	Description
-list	Lists all running nodes. Supports optional use of -states to filter nodes based on node state, and -all to list all nodes.
-states States	Works with -list to filter nodes based on input comma-separated list of node states.
-all	Works with -list to list all nodes.
-status NodeId	Prints the status report of the node.

logs

Dump the container logs

  Usage: yarn logs -applicationId <application ID> <options>

<options>	Description
-applicationId <application ID>	Specifies an application id
-appOwner AppOwner	AppOwner (assumed to be current user if not specified)
-containerId ContainerId	ContainerId (must be specified if node address is specified)
-nodeAddress NodeAddress	NodeAddress in the format nodename:port (must be specified if container id is specified)

classpath

Prints the class path needed to get the Hadoop jar and the required libraries

  Usage: yarn classpath

version

Prints the version.

  Usage: yarn version

<options>	Description
-list	Lists applications from the RM. Supports optional use of -appTypes to filter applications based on application type, and -appStates to filter applications based on application state.
-appStates States	Works with -list to filter applications based on input comma-separated list of application states. The valid application state can be one of the following: ALL, NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED
-appTypes Types	Works with -list to filter applications based on input comma-separated list of application types.
-status <ApplicationId>	Prints the status of the application based on <ApplicationId>
-kill <ApplicationId>	Kills the application <ApplicationId>