Wednesday, September 16, 2015

Spark



What is Apache Spark?

Apache Spark is an open-source distributed general-purpose cluster computing framework with in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL.




The history..

Spark is an open source project that has been built and is maintained by a thriving and diverse community of developers. Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. It was observed that MapReduce was inefficient for some iterative and interactive computing jobs, and Spark was designed in response. Spark’s aim is to be fast for interactive queries and iterative algorithms, bringing support for in-memory storage and efficient fault recovery. Iterative algorithms have always been hard for MapReduce, requiring multiple passes over the same data.




Friday, September 4, 2015

Hadoop Version 1 Commands


Print the Hadoop version

hadoop version


List the contents of the root directory in HDFS

hadoop fs -ls /


List all the hadoop file system shell commands

hadoop fs


'help' command in hadoop

hadoop fs -help


To display the amount of space used and available on the HDFS

hadoop dfsadmin -report OR  hadoop fs -df -h


Count the number of directories, files and bytes under the paths that match the specified file pattern

hadoop fs -count hdfs:/


Run DFS filesystem checking utility

hadoop fsck – /


Run cluster balancing utility

hadoop balancer


Create a new directory named “hadoop” in your home directory

hadoop fs -mkdir /user/UserName/hadoop


Add a sample text file from the local directory named “data” to the new directory you created in HDFS (in above step)

hadoop fs -put data/sample.txt /user/UserName/hadoop


List the contents of this new directory in HDFS

hadoop fs -ls /user/UserName/hadoop


Add an entire local directory to the "/user/UserName" directory in HDFS

hadoop fs -put data/someDir /user/UserName/hadoop


Any command that does not have an absolute path is interpreted as relative to the home directory. To list all the files in your home directory

hadoop fs -ls


Check how much space "localDir" directory occupies in HDFS

hadoop fs -du -s -h hadoop/localDir


Delete a file ‘someFile’ from HDFS

hadoop fs -rm hadoop/localDir/someFile


Delete all files in a directory using a wildcard

hadoop fs -rm hadoop/localDir/*


To empty the trash

hadoop fs -expunge


Remove the entire "localDir" directory and all of its contents in HDFS

hadoop fs -rm -r hadoop/localDir


Copy a local file "someFile.txt" to a directory you created in HDFS
(relative path is used in below example)

hadoop fs -copyFromLocal /home/UserName/someFile.txt hadoop/


To view the contents of your text file "someFile.txt" which is present in the hadoop directory (on HDFS)

hadoop fs -cat hadoop/someFile.txt


Copy the "someFile.txt" from HDFS to a “data” directory in the local directory

hadoop fs -copyToLocal hadoop/someFile.txt /home/UserName/data


'cp' is used to copy files between directories present in HDFS
(all .txt files are copied below)

hadoop fs -cp /user/UserName/*.txt /user/UserName/hadoop


‘get’ command may be used as an alternative to ‘-copyToLocal’ command

hadoop fs -get hadoop/sample.txt /home/UserName/


To display last kilobyte of the file “someFile.txt

hadoop fs -tail hadoop/someFile.txt 


‘chmod’ command may be used to change permissions of a file
(default permissions are 666)

hadoop fs -chmod 600 hadoop/someFile.txt


‘chown’ may be used to change the owner and group of the file

hadoop fs -chown root:root hadoop/someFile.txt


‘chgrp’ command to change group name

hadoop fs -chgrp training hadoop/someFile.txt


'mv' command to move a directory from one location to other

hadoop fs -mv old_loc new_loc


'setrep' may be used to set the replication factor of a file
(Default replication factor of a file is 3, in the below example we set it to 2)

hadoop fs -setrep -w 2 hadoop/someFile.txt


To copy a directory from one node in the cluster to another use ‘distcp’ command
(use -overwrite option to overwrite in an existing files and -update command to synchronize both directories)

hadoop fs -distcp hdfs://namenodeA/hadoop hdfs://namenodeB/hadoop


Command to make the name node leave safe mode

hdfs dfsadmin -safemode leave