Big Data as we know it..: September 2015

What is Apache Spark?

Apache Spark is an open-source distributed general-purpose cluster computing framework with in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL.

The history..

Spark is an open source project that has been built and is maintained by a thriving and diverse community of developers. Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. It was observed that MapReduce was inefficient for some iterative and interactive computing jobs, and Spark was designed in response. Spark’s aim is to be fast for interactive queries and iterative algorithms, bringing support for in-memory storage and efficient fault recovery. Iterative algorithms have always been hard for MapReduce, requiring multiple passes over the same data.

Print the Hadoop version


hadoop version

List the contents of the root directory in HDFS


hadoop fs -ls /

List all the hadoop file system shell commands

hadoop fs

'help' command in hadoop

hadoop fs -help

To display the amount of space used and available on the HDFS

hadoop dfsadmin -report OR hadoop fs -df -h

Count the number of directories, files and bytes under the paths that match the specified file pattern


hadoop fs -count hdfs:/

Run DFS filesystem checking utility


hadoop fsck – /

Run cluster balancing utility


hadoop balancer

Create a new directory named “hadoop” in your home directory


hadoop fs -mkdir /user/UserName/hadoop

Add a sample text file from the local directory named “data” to the new directory you created in HDFS (in above step)


hadoop fs -put data/sample.txt /user/UserName/hadoop

List the contents of this new directory in HDFS


hadoop fs -ls /user/UserName/hadoop

Add an entire local directory to the "/user/UserName" directory in HDFS


hadoop fs -put data/someDir /user/UserName/hadoop

Any command that does not have an absolute path is interpreted as relative to the home directory. To list all the files in your home directory


hadoop fs -ls

Check how much space "localDir" directory occupies in HDFS


hadoop fs -du -s -h hadoop/localDir

Delete a file ‘someFile’ from HDFS


hadoop fs -rm hadoop/localDir/someFile

Delete all files in a directory using a wildcard


hadoop fs -rm hadoop/localDir/*

To empty the trash


hadoop fs -expunge

Remove the entire "localDir" directory and all of its contents in HDFS


hadoop fs -rm -r hadoop/localDir

Copy a local file "someFile.txt" to a directory you created in HDFS
(relative path is used in below example)

hadoop fs -copyFromLocal /home/UserName/someFile.txt hadoop/

To view the contents of your text file "someFile.txt" which is present in the hadoop directory (on HDFS)


hadoop fs -cat hadoop/someFile.txt

Copy the "someFile.txt" from HDFS to a “data” directory in the local directory


hadoop fs -copyToLocal hadoop/someFile.txt /home/UserName/data

'cp' is used to copy files between directories present in HDFS
(all .txt files are copied below)


hadoop fs -cp /user/UserName/*.txt /user/UserName/hadoop

‘get’ command may be used as an alternative to ‘-copyToLocal’ command


hadoop fs -get hadoop/sample.txt /home/UserName/

To display last kilobyte of the file “someFile.txt”


hadoop fs -tail hadoop/someFile.txt

‘chmod’ command may be used to change permissions of a file
(default permissions are 666)

hadoop fs -chmod 600 hadoop/someFile.txt

‘chown’ may be used to change the owner and group of the file

hadoop fs -chown root:root hadoop/someFile.txt

‘chgrp’ command to change group name

hadoop fs -chgrp training hadoop/someFile.txt

'mv' command to move a directory from one location to other


hadoop fs -mv old_loc new_loc

'setrep' may be used to set the replication factor of a file
(Default replication factor of a file is 3, in the below example we set it to 2)

hadoop fs -setrep -w 2 hadoop/someFile.txt

To copy a directory from one node in the cluster to another use ‘distcp’ command
(use -overwrite option to overwrite in an existing files and -update command to synchronize both directories)


hadoop fs -distcp hdfs://namenodeA/hadoop hdfs://namenodeB/hadoop

Command to make the name node leave safe mode


hdfs dfsadmin -safemode leave

Big Data as we know it..

Wednesday, September 16, 2015

Spark

What is Apache Spark?

The history..

Friday, September 4, 2015

Hadoop Version 1 Commands

Print the Hadoop version

List the contents of the root directory in HDFS

List all the hadoop file system shell commands

'help' command in hadoop

To display the amount of space used and available on the HDFS

Count the number of directories, files and bytes under the paths that match the specified file pattern

Run DFS filesystem checking utility

Run cluster balancing utility

Create a new directory named “hadoop” in your home directory

Add a sample text file from the local directory named “data” to the new directory you created in HDFS (in above step)

List the contents of this new directory in HDFS

Add an entire local directory to the "/user/UserName" directory in HDFS

Any command that does not have an absolute path is interpreted as relative to the home directory. To list all the files in your home directory

Check how much space "localDir" directory occupies in HDFS

Delete a file ‘someFile’ from HDFS

Delete all files in a directory using a wildcard

To empty the trash

Remove the entire "localDir" directory and all of its contents in HDFS

Copy a local file "someFile.txt" to a directory you created in HDFS
(relative path is used in below example)

To view the contents of your text file "someFile.txt" which is present in the hadoop directory (on HDFS)

Copy the "someFile.txt" from HDFS to a “data” directory in the local directory

'cp' is used to copy files between directories present in HDFS
(all .txt files are copied below)

‘get’ command may be used as an alternative to ‘-copyToLocal’ command

To display last kilobyte of the file “someFile.txt”

‘chmod’ command may be used to change permissions of a file
(default permissions are 666)

‘chown’ may be used to change the owner and group of the file

‘chgrp’ command to change group name

'mv' command to move a directory from one location to other

'setrep' may be used to set the replication factor of a file
(Default replication factor of a file is 3, in the below example we set it to 2)

To copy a directory from one node in the cluster to another use ‘distcp’ command
(use -overwrite option to overwrite in an existing files and -update command to synchronize both directories)

Command to make the name node leave safe mode

Wednesday, September 16, 2015

Spark

What is Apache Spark?

The history..

Friday, September 4, 2015

Hadoop Version 1 Commands

Print the Hadoop version

List the contents of the root directory in HDFS

List all the hadoop file system shell commands

'help' command in hadoop

To display the amount of space used and available on the HDFS

Count the number of directories, files and bytes under the paths that match the specified file pattern

Run DFS filesystem checking utility

Run cluster balancing utility

Create a new directory named “hadoop” in your home directory

Add a sample text file from the local directory named “data” to the new directory you created in HDFS (in above step)

List the contents of this new directory in HDFS

Add an entire local directory to the "/user/UserName" directory in HDFS

Any command that does not have an absolute path is interpreted as relative to the home directory. To list all the files in your home directory

Check how much space "localDir" directory occupies in HDFS

Delete a file ‘someFile’ from HDFS

Delete all files in a directory using a wildcard

To empty the trash

Remove the entire "localDir" directory and all of its contents in HDFS

Copy a local file "someFile.txt" to a directory you created in HDFS(relative path is used in below example)

To view the contents of your text file "someFile.txt" which is present in the hadoop directory (on HDFS)

Copy the "someFile.txt" from HDFS to a “data” directory in the local directory

'cp' is used to copy files between directories present in HDFS(all .txt files are copied below)

‘get’ command may be used as an alternative to ‘-copyToLocal’ command

To display last kilobyte of the file “someFile.txt”

‘chmod’ command may be used to change permissions of a file(default permissions are 666)

‘chown’ may be used to change the owner and group of the file

‘chgrp’ command to change group name

'mv' command to move a directory from one location to other

'setrep' may be used to set the replication factor of a file(Default replication factor of a file is 3, in the below example we set it to 2)

To copy a directory from one node in the cluster to another use ‘distcp’ command(use -overwrite option to overwrite in an existing files and -update command to synchronize both directories)

Command to make the name node leave safe mode

Copy a local file "someFile.txt" to a directory you created in HDFS
(relative path is used in below example)

'cp' is used to copy files between directories present in HDFS
(all .txt files are copied below)

‘chmod’ command may be used to change permissions of a file
(default permissions are 666)

'setrep' may be used to set the replication factor of a file
(Default replication factor of a file is 3, in the below example we set it to 2)

To copy a directory from one node in the cluster to another use ‘distcp’ command
(use -overwrite option to overwrite in an existing files and -update command to synchronize both directories)