Hadoop Interview Questions | 405 Selected Interview Questions

405 Frequently Asked Hadoop Interview Questions and Answers

1. What is Hadoop?
Answer: Hadoop is a distributed computing platform is written in Java. It will consist of the features such as Distributed File System and MapReduce Processing.

2. What platform and Java version is required for running Hadoop?
Answer: Java 1.6.x or higher version is required to for Hadoop. BSD, Mac OS/X, Solaris, Linux and windows are supported operating system for Hadoop.

3. What are the most common input formats defined in Hadoop?
Answer: Following are the most common input formats defined in Hadoop:
1.TextInputFormat
2.KeyValueInputFormat
3.SequenceFileInputFormat
4. TextInputFormat is a by default input format.

4. What Hardware is best for Hadoop?
Answer: Hadoop runs on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It will depend on the workflow needs.

5. Is it possible to provide multiple inputs to Hadoop? If yes, explain.
Answer: Yes, It is possible to provide multiple inputs to Hadoop.
The input format class provides methods for inserting multiple directories as input to a Hadoop job.

6. What is the relation between job and task in Hadoop?
Answer: Task is the multiple small parts of job in Hadoop.

7. What is distributed cache in Hadoop?
Answer: MapReduce Framework has a facility of distributed cache which is provided to cache files text, archives etc. at the time of execution of the job. This Framework will copy the necessary files to the slave node before it execute any of task at that node.

8. What commands are used to see all jobs running in the Hadoop cluster and kill a job in LINUX?
Answer: Following commands are used to see all jobs running in the Hadoop cluster and kill a job in LINUX
Hadoop job – list
Hadoop job – kill jobID

9. How many instances of a JobTracker run on Hadoop cluster?
Answer: JobTracker is a giant service that is used for submitting and track MapReduce jobs in Hadoop.
One JobTracker process can run on any Hadoop cluster. It can run it within its own JVM process.

10. How JobTracker assign tasks to the TaskTracker?
Answer: The TaskTracker periodically will send heartbeat messages to the JobTracker for assuring that it is alive. The same messages also notify the JobTracker regarding the number of available slots. This return message will update JobTracker to inform whether to schedule task.

11. Is it necessary to write jobs for Hadoop in Java language?
Answer: No, it is not necessary to write jobs for Hadoop in Java language, there are many other ways to deal with non-java codes. Hadoop Streaming will allow any shell command can be used as a map or reduce function.

12. What is Apache Hive?
Answer: Apache Hive is data warehouse software that is used for facilitating managing and querying large data sets which stored in distributed storage. Hive will also permit MapReduce programs to customize mappers and reducers if it is inefficient to run the logic in HiveQL.

13. How Facebook Uses Hadoop, Hive and HBase?
Answer: Facebook data is stored on HDFS, daily numerous photos uploaded within Facebook server with the help of Facebook Messages, Likes and statues updates running on top of HBase. Hive will generate reports for third party developers and advertisers who required to find the success of their campaigns or applications.

14. What is Input Block in Hadoop? Explain.
Answer: Input block is blocks input files into chunks and assign each split to a mapper for processing, when a Hadoop job runs.

15. How many Input blocks is made by a Hadoop Framework?
Answer: Following input blocks is made by a Hadoop a Framework:
One Block for 64K files
Two Block for 65MB files
Two Block for 127MB files
64MB block size is default block. The blocks size are configurable.

16. What is the use of RecordReader in Hadoop?
Answer: The record holder class load the data from its source and will convert it into keys pair suitable for reading by the Mapper. The instance of RecordReader will be defined by the Input Format.

17. What is JobTracker in Hadoop?
Answer: JobTracker is a service within Monitors and will assign Map tasks and will reduce tasks to corresponding task tracker on the data nodes

18. What are the functionalities of JobTracker?
Answer: Following are the main tasks of JobTracker:
Accept jobs from client.
Communicate with the NameNode for determining the location of the data.
Locate TaskTracker Nodes with available slots.
Submit the work to the selected TaskTracker node and monitors progress of each tasks.

19. Define TaskTracker.
Answer: TaskTracker is a node in the cluster which will accept tasks such as MapReduce and Shuffle operations from a JobTracker.

20. What is Map/Reduce job in Hadoop?
Answer: MapReduce is programming paradigm that is used for allowing massive scalability across the thousands of server. MapReduce will map jobs that takes the set of data and will convert it into another set of data in the first step and in the second step, it reduce job. It will take the output from the map as input and compress data tuples into smaller set of tuples.

21. What is Hadoop Streaming?
Answer: Hadoop streaming is a utility that will allow us to create and run map/reduce job. It is a generic API which will allow programs written in any languages can be used as Hadoop mapper.

22. What is a combiner in Hadoop?
Answer: A Combiner is a mini-reduce process that will operate on data generated by a Mapper. Mapper will emit the data then combiner can receive it as input and can send the output to reducer.

23. Is it necessary to know java to learn Hadoop?
Answer: It is helpful to have background in any programming language such as C, C++, PHP, Python, Java etc. It is necessary to know Java and have the basic knowledge of SQL.

24. How to debug Hadoop code?
Answer: There are following most popular ways for debugging Hadoop codes:
By using Counters.
By web interface provided by Hadoop framework.

25. What is YARN?
Answer: YARN is a Next generation MapReduce which is applied in hadoop 0.23 release. It will overcome the scalability issue by dividing the functionality of Job tracker in MapReduce framework into Resource Manager which in not available in classic MapReduce framework.

26. What is data serialization?
Answer: Serialization is the process to convert object data into byte data stream for transmitting over a network across different nodes for persistent data storage or in a cluster.

27. What is deserialization of data?
Answer: Deserialization is the process of changing byte stream data into object data to read data from HDFS. Apache Hadoop will give Writable for deserialization and serialization purpose.Deserialization is the inverse of serialisation of data.

28. What are the value/key Pairs in MapReduce framework?
Answer: MapReduce framework will implement a data model in that data is represented as value/key pairs. Both output and input data to MapReduce framework must be in value/key pairs only.

29. What are the constraints to Value and Key classes in MapReduce?
Answer: For enabling the field to be deserialized and serialized, any datatype which are utilized for a value/key field in a mapper or reducer should implement org.apache.hadoop.io.Writable . By default key/value fields are comparable with each other. Therefore these should implement Hadoop’s org.apache.hadoop.io.WritableComparable Interface that in return will extend Hadoop’s Writable interface.

30. What are the main components of MapReduce Job?
Answer: The main components of MapReduce are Following:
1.Main driver class,
2.Mapper class,
3.Reducer class.

31. What are the key configuration parameters that user require to specify to run MapReduce Job?
Answer: The user of MapReduce framework will require to specify the following to run MapReduce:
1.Job’s output location in the distributed file system.
2.Job’s input location(s) in the distributed file system.
3.Input format.
4.Output format.
5.Class containing map function.
6.Class containing reduce function, but it is optional.
7.JAR file containing the reduc

32. What is the difference between HBase and Hive?

Answer: Apache Hive is a data warehouse infrastructure that will be built on top of Hadoop. Hive will permit to query data which is stored on HDFS to analyse via HQL, an SQL-like coding language. It will be converted to MapReduce jobs. Hive will not provide interactive querying but it will execute batch processes on Apache Hadoop.
Apache HBase is a NoSQL value store which will run on top of Hadoop File System. HBase operations can execute in real-time on database instead of MapReduce jobs. HBase can be divided into tables, the tables are again divided into column families. Column families are declared in schema.

33. What is Hive Metastore?
Answer: Hive Metastore is a database which will save metadata of hive tables including table name, data types, column name,table location, number of buckets, etc.

34. Hive new version supported Hadoop Versions?
Answer: Latest version of Hive is 2.0 which supported Hadoop version.

35. Which are some of big companies mostly using Hive?
Answer: Facebook and Netflix mostly use Hive.

36. Wherever (Different Directory) I run hive query, it creates new metastore_db, please explain the reason for it?
Answer: Whenever we execute the Apache Hive in an embedded mode, it will create the local Metastore. While creating Metastore Hive checks whether Metastore is already created or not. This property can be checked in configuration file Hive – site.xml. Property is “javax.jdo.option.ConnectionURL” with default value “jdbc:derby:;databaseName=metastore_db;create=true”.

37. Is it possible to use same metastore by multiple users, in case of embedded Hive?
Answer: No, Metastore will not be used in sharing mode. It is always needed to use stand-alone “real” database like MySQL or PostGresSQL.

38. What is the usage of Query Processor in Apache Hive?
Answer: Query processing implements the processing framework to translate SQL to a graph of MapReduce jobs.

39. Is multi line comment supported in HIVE Script?
Answer: No, Multi line comment cannot be supported in HIVE Script.

40. What is a Hive Metastore?
Answer: HIVE Metastore is a central repository which will save metadata in external database.

41. Explain about the SMB Join in Hive.
Answer: All mapper will reads a bucket from the 1st table and the equivalent bucket from the second table and after that a merge sort join is conducted in Sort Merge Bucket (SMB) join in Hive. SMB is mainly utilized because there is no limit on partition or file or table joins. For very large tables, it is very useful. In SMB join the columns can be sorted and bucketed using the join columns.

42. Explain about the different types of join in Hive.
Answer: HiveQL has following different types of joins:
JOIN- Same as Outer Join in SQL
FULL OUTER JOIN– It will combine the records of both the right and left outer tables which fulfil the join condition.
RIGHT OUTER JOIN– Each of the rows from the right table will be reverted even though there is no match in the left table.
LEFT OUTER JOIN– Each of the rows from the left table will be reverted even though there is no match in the right table.

43. What is ObjectInspector usage?
Answer: ObjectInspector is utilized for analyzing the internal structure of the row objects and the structure of individual columns. ObjectInspector in Hive will allow access to complex objects which can be saved in various formats.

44. Is it possible to change the default location of Managed Tables in Hive, if so how?
Answer: Yes, we can alter the default location of managed tables by utilizing the LOCATION keyword during creation of the managed table. The user will specify the path of the managed table as the value to the LOCATION keyword.

45. How can you connect an application, if you run Hive as a server?
Answer: When run Hive as a server, the application will be connected in one of the following ways:
1. ODBC Driver-This will support the ODBC protocol
2. JDBC Driver-This will support the JDBC protocol
3. Thrift Client-Thrift client will be utilized for making calls to all hive commands using programming language like PHP, Python, Java, C++ and Ruby.

46. Which classes are used by the Hive to Read and Write HDFS Files.

Answer: Hive will use following classes to perform read and write operations:
TextInputFormat/HiveIgnoreKeyTextOutputFormat: These class will perform read and write data in simple text file format.
SequenceFileInputFormat/SequenceFileOutputFormat: These class will read/write data in hadoop SequenceFile format.

47. What is identity Mapper Apache Hadoop?
Answer: Mapper Apache Hadoop is a default Mapper class given by Apache Hadoop. Identity Mapper does not process tasks on input data, it writes the output data into input. Identity Mapper class name is org.apache.hadoop.mapred.lib.IdentityMapper.

48. What is identity Reducer in Apache Hadoop?
Answer: Identity Mapper will pass on the input value/key pairs into output directory. Identity Reducer class name is org.apache.hadoop.mapred.lib.IdentityReducer. If not a single reducer class is specified within MapReduce job, then the class are taken up automatically by the job.

49. What is chain Mapper?
Answer: Chain Mapper is an implementation of Mapper class via which number of mapper classes will be executed in a chain fashion, within a single map task. Name of the Chain Mapper class is org.apache.hadoop.MapReduce.lib.ChainMapper.

50. What is chain reducer?
Answer: It is similar to Chain Mapper class through which a number of mappers followed by a single reducer can be executed in a single reducer task. Chain Mapper class name is org.apache.hadoop.MapReduce.lib.ChainReducer.

51. How to mention multiple mappers and reducer classes in Chain Reducer or Chain Mapper classes?
Answer: In Chain Mapper, ChainMapper.addMapper() method will be used for adding classes in mapper.
In ChainReducer,
• ChainReducer.setReducer() method can be utilized for pinpointing the single reducer class.
• ChainReducer.addMapper() method can be used for adding mapper classes.
52. What is side data distribution in MapReduce framework?
Answer: Side data is the extra read-only data required by a MapReduce job for performing task on the main data set. In Hadoop there are two ways for making side data available to all the reduce or map tasks:
• Distributed cache
• Job Configuration

51. How distribution of side data can be done using job configuration?
Answer: Using the various setter methods on Configuration object side data will be distributed setting up an arbitrary key-value pairs in the job configuration. We receives the data from the configuration method getConfiguration() method of context within the task .

52. When can side data distribution be used for Job Configuration and when it is not supposed?
Answer: Side data distribution by job configuration is useful only when programmer required to pass a piece of meta data for mapping or reducing tasks. This mechanism will not be followed to move more than a few KB’s of data because it will impart pressure on the memory usage, mainly when a system runs hundreds of jobs.

53. What is Distributed Cache in MapReduce?
Answer: It is another method for side data distribution by duplicating files. It will archive to task nodes in time; so, that tasks can use them when they execute. To save network bandwidth, files can usually copied to any specific node once per job.

54. How to provide files or archives to MapReduce job in distributed cache mechanism?
Answer: Files which will require to be distributed are specified as a comma-separated list of URIs as the argument to the -files option in Apache Hadoop job command. Files can be on HDFS.
Archive files (tar files, ZIP files, and gzipped tar files) are copied to task nodes by distributed cache by usage of -archives option.

55. Explain how distributed cache works in MapReduce Framework?
Answer: When Apache MapReduce job will be submitted with distributed cache options, the node managers will duplicate the files specified by the –archives, -files, and -libjars options from distributed cache to a local storage disk. local.cache.size property are used for configuring setup cache size on local storage disk of node managers. Data can be localized under the ${hadoop.tmp.dir}/mapred/local directory.

56. What will Apache Hadoop do when a task has failed in a list of suppose 50 spawned tasks?
Answer: Apache Hadoop can restart the reduce or map task again on some other node manager, and only if the task will fail more than four times then task is killed. The default limit of maximum attempts for map and reduce tasks is determined by using below mentioned properties in mapred-site.xml file.
• MapReduce.map.maxattempts
• MapReduce.reduce.maxattempts

57. Suppose: In MapReduce system, HDFS block size is 256 MB and we have three files of size 248 KB, 268 MB and 512 MB then how many input splits will be created by Hadoop framework?
Answer: Hadoop can create five splits as follows:
1. 1 split for 248 KB file
2. 2 splits for 268 MB file (1 of 256 MB and another of 12 MB)
3. 2 splits for 512 MB file (2 Splits of 256 MB)

58. Why can’t we just have the file in HDFS and have the application read it instead of distributed cache?
Answer: Distributed cache will duplicate the file to all node managers at the beginning of the job. If the node manager will run 10 or 50 reduce or map tasks, it can utilize the same file copy.
If a file needs to read from HDFS in the job then every reduce or map task can access it from HDFS and therefore if a node manager will run 50 map tasks then it can read this file 50 times from HDFS. Accessing the same data from Local FS of node manager is lot faster than from HDFS data nodes.

59. What is Uber task in YARN?
Answer: The application master can run them in the same JVM as itself for small job, since it will judge the overhead of allocating new containers and will execute task in them as outweighing the advantage to be had in executing them in parallel, which are compared to execute them sequentially on the one node.

60. How to configure Uber Tasks?
Answer: A small job is called which has by default less than ten mappers and 1 reducer, and the size of input is less than the size of 1 HDFS block.
Values can be altered for a job by setting: MapReduce.job.ubertask.maxreduces, MapReduce.job.ubertask.maxmaps ,
MapReduce.job.ubertask.maxbytes.
It can be disable Uber tasks completely by setting MapReduce.job.ubertask.enable to false.

61. What are the three ways to debug a failed MapReduce job?
Answer: There are the three ways to debug a failed MapReduce job:
By using MapReduce job counters
• YARN Web UI To checking into syslogs for actual status or error messages.

70. What is the significance of heartbeats in HDFS/MapReduce Framework?
Answer: A heartbeat in master or slave architecture is a signal which depicts that is active. A datanode will send heartbeats to Namenode and then node managers can deliver their heartbeats to Resource Managers to notify the master node that these are still active.

71. Can we rename the output file?
Answer: Yes,We can rename the output file.

72. What are the default formats of input and output file in MapReduce jobs?
Answer: The default file format of input or output file are text files, if they are not set.

73. Is it possible to create multiple table in hive for same data?
Answer: Yes, It is possible to create multiple table in hive for same data.

74. What kind of Data Warehouse application is suitable for Hive?
Answer: Apache Hive is not a full database. The design limitations of Hadoop and HDFS will impose limit on performance of Hive. It is not a full database, It is built for data warehouse applications, where:
• Relatively static data is analyzed,
• Fast response time is not required,
• When the data is not rapidly changing.
Hive will not provide crucial properties required for OLTP, Online Transaction Processing. Hive is good for data warehouse applications, where a large data set is handled for insights, reports, etc.

75.What is the maximum size of string data type supported by Hive?
Answer: 2 GB is the maximum size of string data type supported by Hive.

76. What is MapReduce in Hadoop?
Answer: MapReduce is a framework to process huge raw data sets utilizing a large number of computers. It can help to processes the raw data in the Map phase and Reduce phase. MapReduce programming model will be easily processed on large scale data. It is integrated with HDFS to process distributed across data nodes of clusters.

77. What are the key components of Job flow in YARN architecture?
Answer: Following are the key components of job flow in YARN architecture:
• A Client node that will submit the MapReduce job.
• YARN Node Managers can launch and monitor the tasks of jobs.
• MapReduce Application Master will coordinate the tasks which run in the MapReduce job.
• YARN Resource Manager will allocate the cluster resources to jobs.
• HDFS file system can be used to share job files between the above entities.

78. What is the importance of Application Master in YARN architecture?
Answer: It will help in negotiating resources from the resource manager and work with the Node Manager(s) to run and monitor the tasks. Application Master will make request to containers for all map and reduce tasks. Containers are assigned to tasks, So, it will start containers by reporting its Node Manager. It will collect progress information from all the tasks and values will be propagated to user or client node.

79. After restarting namenode, MapReduce jobs started to fail which where working fine before restart. What may be the reason for such failure?
Answer: The Hadoop cluster can be in safe mode after the restart of namenode. The administrator will wait for namenode for exiting safe mode before restarting jobs again. It is the mistake which is very commonly done by Hadoop administrators.

80. What are the things that you need to mention for a MapReduce job?
A. Classes for reducer, mapper, and combiner.
B. Classes for the reducer, partitioner, mapper, and combiner
C. None
D. Classes for mapper and reducer.
Answer: D. classes for the mapper and reducer.

81. How many times combiner will execute?
A. 0, 1, or many times.
B. Can’t be configured
C. At least once.
D. 0 or 1 time.
Answer: A. 0, 1, or many times.

82. What are the types of tables in Apache Hive?
Answer: There are following two types of tables in Apache Hive
1.Managed tables.
2.External tables.

83. What is Components of Hadoop?
Answer: Following are the components of Hadoop:
1. Storage unit–HDFS (NameNode, DataNode)
2. Processing framework– YARN (ResourceManager, NodeManager)

84. What are HDFS?
Answer: Hadoop Distributed File System (HDFS) is the storage unit of Hadoop which is responsible to store different kinds of data as blocks in a distributed environment. It has master and slave topology.

85. What are the components of HDFS?
Answer: NameNode: It is the master node in the distributed environment and can maintain the metadata information for the blocks of data which are stored in HDFS such as block location, replication factors etc.

DataNodes are the slave nodes which are responsible to store data in the HDFS. NameNode can manage all the DataNodes.

86.What are the components of YARN?
Answer: ResourceManager: It can receive the processing requests, then will pass the parts of requests to corresponding Node Managers accordingly, where the actual processing will take place. It will allocate resources to applications based on the needs.
NodeManager can be installed on every DataNode and it will be responsible for execution of the task on every single DataNode.

• ApplicationMaster – ApplicationMaster is a per-application component which will not perform any application-specific work, because these functions are delegated to the containers. Instead, it is responsible for negotiating resource requirements for the resource manager and working with NodeManagers to execute and monitor the tasks. The ApplicationMaster will be responsible for the specific fault-tolerance behavior of the application. It receives status messages from the ResourceManager when its containers fail, and it can decide to take action based on these events (by asking the ResourceManager to create a new container), or to ignore these events.
• Container – A container is an application-specific process that’s created by a NodeManager on behalf of an ApplicationMaster with a constrained set of resources (Memory, CPU, etc.)
• YARN child – After submitting the application, application master dynamically launch YARN child to do the MapReduce tasks.

87. What are the modes in which Hadoop run?
Answer: Apache Hadoop will run in following modes:
1. Local (Standalone) Mode – Hadoop by default can run in a single-node, non-distributed mode, as a single Java process. Local mode can use the local file system for input and output operation. It can be also used to debug purpose, and it will not support the use of HDFS. There is no custom configuration can be required for configuration files.
2. Pseudo-Distributed Mode – In Standalone mode, Hadoop will run on a single-node in a Pseudo-distributed mode. Each daemon will run in a separate Java process in this Mode. In Pseudo-distributed mode, we required configuration for all the four files mentioned above. All daemons can run on one node and both Master and Slave node are the same.
3. Fully-Distributed Mode – In this mode, all daemons will execute in separate nodes which will form a multi-node cluster. Thus, it will allow separate nodes for Master and Slave.

88. What are the features of Standalone (local) mode?
Answer: Hadoop will run in a single-node, non-distributed mode, like a single Java process. Local mode can use the local file system for input and output operation. One can also use it to debug. It will not support the use of HDFS. Standalone mode is only suitable for running programs during development for testing. There is no requirement for custom configuration for configuration files. Configuration files are:
core-site.xml
hdfs-site.xml files.
mapred-site.xml
yarn-default.xml

89.Explain the major difference between HDFS block and InputSplit.
Answer: Block is the physical representation of data and split is the logical representation of data which is present in the block. Split can act as an intermediary between block and mapper.
Assume we have two blocks:
Block 1: ii nntteell
Block 2: II ppaatt
Now, by considering the map, it will read first block from ii till ll, but will not know how to process the second block at the same time. Split will work here which will form a logical group of Block1 and Block 2 as a single block.
It will form key-value pair using inputformat and will record reader and send map to process further with inputsplit, if we have limited resources, we will increase the split size to limit the number of maps. Such as if there are 10 blocks of 640MB (64MB each) and there are limited resources; we can assign ‘split size’ as 128MB. This forms a logical group of 128MB, with only 5 maps which will execute at a time.
If the ‘split size’ property set to false, whole file forms one inputsplit and can be processed by single map, which will consume more time when the file is bigger.

90. What Are Edge Or Gateway Nodes?
Answer: Interface between the Hadoop cluster and the outside network are Edge nodes. They are referred as gateway nodes. Edge nodes can be used to run client applications and cluster administration tools. Edge-nodes can be kept separate from the nodes which contain Hadoop services like HDFS, MapReduce, etc, mainly for keeping computing resources separate. Edge nodes can run within the cluster allow to centralized management of all the Hadoop configuration entries on the cluster nodes that will help to reduce the amount of administration needed for updating the config files.
There is limited security within Hadoop itself, if Hadoop cluster operates in a local- or wide-area network behind an enterprise firewall, we will want to consider a cluster-specific firewall to more fully protect non-public data which may reside in the cluster. In this deployment model, assume Hadoop cluster as an island within IT infrastructure — for every bridge to that island we should consider an edge node for security.

91. What is distributed cache?
Answer: Distributed Cache, in Hadoop, is a service by MapReduce framework to cache files when required. Once a file is cached for a specific job, hadoop can make it available on each data node both in system and in memory, where map and reduce tasks will be executed.We can easily access and read the cache file and populate any collection (like array, hashmap) in code.

92. What are the Benefits of using distributed cache?
Answer: It will distribute simple, read only text/data files and/or complex types such as jars, archives and others. These archives will be un-archived at the slave node.
Distributed cache can track the modification timestamps of cache files, which will notify that the files can not be modified until a job is executed currently.

93. What are the features of Pseudo mode?
Answer: Hadoop will also run on a single-node in this mode. Hadoop daemon will run in a separate Java process in this Mode. In Pseudo-distributed mode, we required configuration for all the four files mentioned. All daemons can run on one node therefore, both Master and Slave node are the same.
The pseudo mode will be suitable for both for development and in the testing environment. In the Pseudo mode, all the daemons can run on the same machine.

94. What are the features of Fully-Distributed mode?
Answer: All daemons will execute in separate nodes forming a multi-node cluster in fully-Distributed mode. We will allow separate nodes for Master and Slave.
We can use this mode in the production environment, where ‘n’ number of machines will form a cluster. Hadoop daemons can run on a cluster of machines. There is one host onto which NameNode runs and the other hosts on which DataNodes runs. Therefore, NodeManager will install on every DataNode. It is responsible for the execution of the task on every single DataNode.
The ResourceManager can manage all these NodeManager. ResourceManager will receive the processing requests. It will pass the parts of the request to corresponding NodeManager accordingly.

95. What are configuration files in Hadoop?
Answer: Core-site.xml – It will contain configuration setting for Hadoop core like I/O settings which are common to HDFS and MapReduce. It will use Hostname and port .The most commonly used port is 9000.
hdfs-site.xml – This file will contain the configuration setting for HDFS daemons. hdfs-site.xml will specify default block replication and permission checking on HDFS.
mapred-site.xml – In this file, a framework name for MapReduce can be specified. We specify by setting the mapreduce.framework.name.
yarn-site.xml – This file can provide configuration setting for NodeManager and ResourceManager.

96. What are the limitations of Hadoop?
Answer: Various limitations of Hadoop are:
Problem with small files – Hadoop is not suited for small files. Small files can be the major problems in HDFS. A small files are significantly smaller than the HDFS default block size 128MB. If we can store these large number of small files, HDFS will not handle these lots of files. As HDFS will work with a small number of large files to store data sets rather than larger number of small files. If one use the huge number of small files, then this can overload the namenode. Since namenode can store the namespace of HDFS.
HAR files, Sequence files, and Hbase will overcome small files issues.
Processing Speed – With parallel and distributed algorithm, MapReduce can process large data sets. MapReduce will perform the task: Map and Reduce. MapReduce needs a lot of time to perform these tasks thereby increasing latency. Data can be distributed and processed over the cluster. Therefore, it can increase the time and can reduce processing speed.
Support only Batch Processing – Hadoop will support only batch processing. It will not process streamed data and therefore, overall performance is slower. MapReduce framework will not leverage the memory of the cluster to the maximum.
Iterative Processing – It is not efficient for iterative processing. As hadoop will not support cyclic data flow. In the chain of stages the input to the next stage is the output from the previous stage.
Vulnerable by nature – Hadoop is written in Java, a language which is widely used. Therefore java is most heavily exploited by cyber-criminal. Therefore it will implicate in numerous security breaches.
Security- Hadoop are challenging in managing the complex application. Hadoop is missing encryption at storage and network levels, that is a major point of concern. Hadoop can support Kerberos authentication, that is hard to manage.

97. Explain the difference between NameNode, Checkpoint NameNode and BackupNode.
Answer: NameNode is the core of HDFS which will manage the metadata – the information about the file can be mapped to block locations and the blocks are stored on which datanode. Means it is the data about the data being stored. NameNode will support a directory tree-like structure consisting of all the files present in HDFS on a Hadoop cluster. It will use following files for namespace:
fsimage file- It will keep track of the latest checkpoint of the namespace.
edits file-It is a log of changes which have been made to the namespace since checkpoint.
Checkpoint has the similar directory structure like NameNode. It will create checkpoints for namespace at regular intervals by downloading the fsimage and edits file and margin them within the local directory. The new image after merging is then uploaded to NameNode.
There is a similar node like Checkpoint, called as Secondary Node, but it will not support the ‘upload to NameNode’ functionality.
Backup Node will provide similar functionality as Checkpoint, enforcing synchronization with NameNode. It will maintain an up-to-date in-memory copy of file system namespace and will not require getting hold of changes after regular intervals. The backup node required to save the current state in-memory to an image file to create a new checkpoint.

98.What are the most common Input Formats in Hadoop?
Answer: There are following most common input formats in Hadoop:
1. Text Input Format: It is default input format in Hadoop.
2. Key Value Input Format: It is used for plain text files where the files can be broken into lines
3. Sequence File Input Format: It is used to read files in sequence

99. Define DataNode and how does NameNode tackle DataNode failures?
Answer: DataNode is a node where actual data resides in the file system and it stores data in HDFS. Each datanode can send a heartbeat message for notifying that it is alive. If the namenode will not receive a message from datanode for 10 minutes, it will consider it to as dead or out of place, and will start replication of blocks which were hosted on that data node like they are hosted on some other data node. A BlockReport will contain list of all blocks on a DataNode. The system will start to replicate the information which were stored in dead DataNode.
The NameNode will manage the replication of data blocksfrom one DataNode to other. The replication data will transfer directly between DataNode like the data can never pass the NameNode.

100. What are the core methods of a Reducer?
Answer: The three core methods of a Reducer are:
1. setup(): this method is used to configure various parameters such as input data size, distributed cache.
public void setup (context)
2. reduce(): heart of the reducer is always called once per key with the associated reduced task
public void reduce(Key, Value, context)
3. cleanup(): This method is called for cleaning temporary files, only once at the end of the task
public void cleanup (context)

101. What is SequenceFile in Hadoop?
Answer: SequenceFile is a flat file which contains binary key/value pairs and it is extensively used in MapReduce I/O formats,. The map outputs are stored like SequenceFile internally. It will provide Reader, Writer and Sorter classes.
The three SequenceFile formats are:
1. Uncompressed key/value records.
2. Record compressed key/value records – values are compressed but keys are not.
3. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the block is configurable.

102. What is Speculative Execution in Hadoop?
Answer:
Limit of Hadoop: There are chances that few slow nodes limit the rest of the program by distributing the tasks on several nodes,. There are various reasons for the tasks can be slow, that are not easy to identify. Hadoop will not identify and fix the slow running tasks, It will try to detect when the task is running slower than expected and then it will launch other equivalent task as backup. This backup mechanism in Hadoop is called Speculative Execution.

103. What will happen, if we are trying to run a Hadoop job with an output directory which is already present?
Answer: If we try to run a Hadoop job with an output directory that is already present, It will throw an exception stating that the output file directory is already exists.

104.What are active and passive “NameNodes”?
Answer: In High Availability architecture, there are two NameNodes – Active NameNode and Passive NameNode.
Active NameNode is the “NameNode” that will work and runs in the cluster.
Passive NameNode is a standby “NameNode”, that has same data as active “NameNode”.
When the active “NameNode” will fail, the passive “NameNode” can replace the active “NameNode” in the cluster. Hence, the cluster is never without a “NameNode” and so it will never fail.

105.Why does one remove or add nodes in a Hadoop cluster frequently?
Answer: Utilisation of commodity hardware is an attractive features of the Hadoop framework. This will lead to frequent DataNode crashes in a Hadoop cluster. ease of scale in accordance with the rapid growth in data volume is another striking feature of Hadoop Framework. Because of these two features, the most common task of a Hadoop administrator is to Add and Remove Data Nodes in a Hadoop Cluster.

106.What happens when two clients try to access the same file in the HDFS?
Answer: HDFS will support exclusive writes only.
The NameNode will grant a lease to the client for creating this file, if the first client will contact the It for opening the file to write. When the second client will try to open the same file to write, the “NameNode” will notice that the lease for the file is already granted to another client, and can reject the open request for the second client.

107. What is a checkpoint?
Answer: Checkpoint is a process which will take an FsImage, edit log and compacts them into a new FsImage. Therefore, instead of replaying an edit log, the NameNode will load the final in-memory state directly from the FsImage. This is a far more efficient operation and will reduce NameNode startup time. Checkpoint will be performed by Secondary NameNode.

108. How is HDFS fault tolerant?
Answer: When data will be stored over HDFS, NameNode can replicate the data to several DataNode. The default replication factor is 3. We will change the configuration factor as per requirement. If a DataNode will go down, the NameNode can automatically copy the data to another node from the replicas and will make the data available. This can give fault tolerance in HDFS.

109. Can NameNode and DataNode be a commodity hardware?
Answer: DataNodes are commodity hardware such as personal computers and laptops as it can store data and are needed in a large number. NameNode is the master node and it will store metadata about all the blocks stored in HDFS. It needs high memory (RAM) space, therefore NameNode requireds to be a high-end machine with good memory space.

110. Why do we use HDFS for applications having large data sets and not when there are a lot of small files?
Answer: HDFS is suitable for large amounts of data sets in a single file as compared with small amount of data spread across multiple files. The NameNode will store the metadata information about file system in the RAM. Therefore, the amount of memory can produce a limit to the number of files in HDFS file system. Too much of files can lead to generation of too much meta data. To store these meta data in the RAM is a challenge. As a thumb rule, metadata for a file, block or directory will take 150 bytes.

111. How do you define “block” in HDFS? What is the default block size in Hadoop 1 and Hadoop 2 whether it can be changed?
Answer: Blocks are the smallest continuous location on hard drive where data is stored. HDFS can store each as blocks, and will distribute it across the Hadoop cluster. Files in HDFS can be broken down into block-sized chunks, that are stored as independent units.
Hadoop 1 default block size: 64 MB
Hadoop 2 default block size: 128 MB
Yes, blocks are configured. The dfs.block.size parameter are used in the hdfs-site.xml file to set the size of a block in a Hadoop environment.

112. What does ‘jps’ command do?
Answer: The ‘jps’ command can help us to check, if the Hadoop daemons are running or not. It will show all the Hadoop daemons example for namenode, datanode, resourcemanager, nodemanager etc. which are running on the machine.

113. How do you define “Rack Awareness” in Hadoop?
Answer: Rack Awareness is An algorithm where the NameNode will decide how blocks and their replicas can be placed. It is based on rack definitions for minimizing network traffic between DataNodes within the same rack. We will consider replication factor 3 by default, the policy states that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is called as the “Replica Placement Policy”.

114.What is “speculative execution” in Hadoop?
Answer: If a node is seems to be executed a task slower, the master node will execute another instance of the same task on another node. Then, the task that has finished first can be accepted and the other one is killed. This process is known as “speculative execution”.

115.How can I restart “NameNode” or all the daemons in Hadoop?
Answer: By following we can restart NameNode methods:
1.We can stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and then can start the NameNode using ./sbin/hadoop-daemon.sh start namenode command.
2.To stop and start all the daemons, use. /sbin/stop-all.sh and then we can use ./sbin/start-all.sh command which stops all the daemons first and then start all the daemons.

116.What is the difference between an “HDFS Block” and an “Input Split”?
Answer: The HDFS Block can be define as the physical division of the data and Input Split can be defined as the logical division of the data. HDFS can divide data in blocks to store the blocks together, whereas for processing, MapReduce can divide the data into the input split and will assign it to mapper function.

117. State the reason why we can’t perform “aggregation” (addition) in mapper? Why do we need the “reducer” for this?
Answer: We cannot perform “aggregation” (addition) in mapper because sorting will not occur in the “mapper” function. Sorting can occur only on the reducer side and without sorting aggregation will not be done.
During “aggregation”, we required output of all the mapper functions which may not be possible to collect in the map phase as mappers can be running on different machine where the data blocks will be stored.
If we will try to aggregate data at mapper, it will require communication between all mapper functions which can be running on different machines. Therefore, it will consume high network bandwidth and can cause network bottlenecking.

118.What is the purpose of “RecordReader” in Hadoop?
Answer: The “InputSplit” can define a slice of work, and will not describe how to access it. The “RecordReader” class will load the data from its source and can convert it into (key, value) pairs suitable for reading by the “Mapper” task. The RecordReader instance is explained by the Input Format.

119. How do “reducers” communicate with each other?
Answer: The “MapReduce” programming model will not allow to communicate with each other. “Reducers” will run in isolation.

120. What is a “Combiner”?
Answer: A Combiner is a mini “reducer” which will perform the local reduce task. It receives the input from the “mapper” on a particular “node” and will send the output to the “reducer”. Combiners will help in enhancement of the efficiency of “MapReduce” by reducing the quantum of data which is required to be sent to the “reducers”.

121. What do you know about “SequenceFileInputFormat”?
Answer: SequenceFileInputFormat is an input format for reading within sequence files. Sequence files can be generated like output of other MapReduce tasks. It will have an efficient intermediate representation for data which is passing from one MapReduce job to another.

122. What is the problem with small files in Hadoop?
Answer: Hadoop is not suited for small data. Hadoop HDFS will lack the ability to support the random reading of small files. Small file in HDFS is smaller than the 128MB which is default block size in HDFS. If we will store these huge numbers of small files, HDFS will not able to handle these lots of files. HDFS works with the small number of large files to store large datasets. A large number of many small files will overload NameNode since it stores the namespace of HDFS.

123. Why YARN?
Answer: With older versions of Hadoop, we were limited to execute MapReduce jobs only. But it was restrictive for graph processing, iterative computing, or any other type of work.
In Hadoop 2 the scheduling pieces of MapReduce are separated and addd a new component called YARN. The YARN will not care about the type of applications which are running, and it also not care about keeping any historical information about execution on the cluster. Thus, YARN can scale beyond the levels of MapReduce.

124. What are the YARN responsibilities?
Answer: YARN is responsible for following activities:
• Response to a client’s request for creating a container. A container is with a contract governing the physical resources which it’s permitted to use in essence a process.
• Monitoring containers which are running, and terminate them if required.
Containers are terminated, if a YARN scheduler will want to free up resources so that containers from other applications will run, if a container is using more than its allocated resources.

125. What are the benefits YARN brings in to Hadoop?
Answer: Following are benefits of YARN bring in to Hadoop:
1. Yarn can efficiently utilise the resource. It has no more fixed map-reduce slots. YARN can provide central resource manager. With YARN, we can now run multiple applications in Hadoop, all shares a common resource.
2. Existing MapReduce job can run on Hadoop 2.0 without any change means it is backward compatible.
3. YARN has fixed the old mapreduce scalability issue and run on larger clusters than MapReduce 1.
4. It has opened up Hadoop to other types of distributed application.

126. Explain how scalability issue is fixed in YARN?
Answer: In YARN, Contrast to the jobtracker in MapReduce 1, instance of an application states a MapReduce job has a dedicated application master, which can run for the duration of the application. This model is closer to the original Google MapReduce paper, that can describe the way of a master process is started to coordinate map and reduce tasks running on a set of workers.
MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000 tasks, because of the reason that the jobtracker has to manage both jobs and tasks. YARN has overcome these limitations by split resource manager/application master architecture that is designed to scale up to 10,000 nodes and 100,000 tasks.

127. What is ResourceManager in YARN?
Answer: The ResourceManager in YARN is master process, and it arbitrate resources on a Hadoop cluster. It will respond to client requests for creating containers, and a scheduler will determine when and where a container are created.
The ResourceManager has following components – Scheduler and AppicationsManager
• Scheduler – The scheduler is responsible to allocate resources.
• ApplicationManager – The ApplicationsManager is responsible to accept job-submissions, negotiate the first container for executing the application specific ApplicationMaster and will provide the service to restart the ApplicationMaster container on failure.

128. What is ApplicationMaster in YARN?
Answer: ApplicationMaster is a per-application component that will not perform any application-specific work, as these functions can be delegated to the containers. It is responsible to negotiate resource requirements for the resource manager and works with NodeManagers for executing and monitoring the tasks.
The ApplicationMaster is responsible for the specific fault-tolerance behavior of the application. It will receive status messages from the ResourceManager when its containers will fail, and it will decide for taking action based on these events by asking the ResourceManager to create a new container, or to ignore these events.

129. What are the scheduling policies available in YARN?
Answer: YARN scheduler is responsible to schedule resources to user applications based on a defined scheduling policy. YARN can provide three scheduling options:
• FIFO Scheduler – FIFO scheduler can put application requests in queue and will run them in the order of submission (first in, first out). Requests for the first application in the queue will be allocated first; once the requests have been completed, the next application in the queue is served, and so on.
• Capacity Scheduler – Capacity scheduler has a separate dedicated queue for smaller jobs and can be started them as soon as they are submitted.
• Fair Scheduler – Fair scheduler dynamically balances and allocates resources between all the running jobs. When the first large job will start, it will be the only job running, so it will gets all the resources in the cluster. When the second small job will start, it is allocated half of the cluster resources therefore each job is using its fair share of resources.

130. How do you setup ResourceManager to use CapacityScheduler?
Answer: We can configure the ResourceManager for using CapacityScheduler by setting the value of property yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler in the file conf/yarn-site.xml.

131. How do you setup ResourceManager to use FairScheduler?
Answer: We can configure the ResourceManager for using FairScheduler by setting the value of property yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler in the file conf/yarn-site.xml

132. What is Speculative execution?
Answer: A job which runs on a Hadoop cluster can be divided in to many tasks. In a big cluster some of these tasks can run slow for various reasons, like hardware degradation or software miconfiguration etc. Hadoop will initiate a replica of a task when it will sees a tasks which runs for sometime and failed to make any progress, on average, as the other tasks from the job. This replica or duplicate exeuction of task is referred to as Speculative Execution.
When a task will be completed successfully all the duplicate tasks which are running will be killed. Therefore, if the original task are completed before the speculative task, then the speculative task is killed; if the speculative task will finishes first, then the original is killed.

133. How do you debug a performance issue or a long running job?
Answer: To debug a performance level we will follow the below steps.
1. Understand the symptom
2. Analyze the situation
3. Identify the problem areas
4. Propose solution

Scenario 1 – Job with 100 mappers and 1 reducer will take a long time for the reducer for starting after all the mappers can be completed. Reduce will spend a lot of time copying the map outputs. Therefore in this case we will try couple of things.
1. Add a combiner for reducing the amount of output from the mapper are sent to the reducer
2. Enable map output compression that will further reduce the size of the outputs are transferred to the reducer.

Scenario 2 – A particular task uses a lot of memory that causes the slowness or failure, It will look for ways to reduce the memory usage.
1. Make sure the joins are made in an optimal way with memory usage in mind. For example in Pig joins, the LEFT hand side tables are sent to the reducer first and will held in memory and the RIGHT most table in streamed to the reducer. Therefore make sure the RIGHT most table should be largest of the datasets in the join.
2. We will also increase the memory requirements required by the map and reduce tasks by setting – mapreduce.map.memory.mb and mapreduce.reduce.memory.mb

Scenario 3 – Understanding the data can help a lot in optimizing the way we will use the datasets in PIG and HIVE scripts.
1. If we have smaller tables in join, those are sent to distributed cache and loaded in memory on the Map side and the entire join can be done on the Map side. This can tremendously improve performance. Look up USING REPLICATED in Pig and hive.auto.convert.join or MAPJOIN in Hive.
2. If the data is already sorted we uses USING MERGE which can do a Map Only join
3. If the data will be bucketted in hive, we may use hive.optimize.bucketmapjoin or
hive.optimize.bucketmapjoin.sortedmerge which depends on the characteristics of the data

Scenario 4 – The Shuffle process is the heart of a MapReduce program and it will be tweaked for performance improvement.
1. If we see lots of records will spilled to the disk (check for Spilled Records in the counters in MapReduce output) we can increase the memory available for Map for performing the Shuffle by increasing the value in io.sort.mb. This can reduce the amount of Map Outputs which are written to the disk so the sorting of the keys are performed in memory.
2. On the reduce side the, merging the output from several mappers are done in disk by setting the mapred.inmem.merge.threshold to 0

134. Assume you have Research, Marketing and Finance teams funding 60%, 30% and 10% respectively of your Hadoop Cluster. How will you assign only 60% of cluster resources to Research, 30% to Marketing and 10% to Finance during peak load?
Answer: Capacity scheduler in Hadoop is designed for supporting the use case. Capacity scheduler will support hierarchical queues and capacity which are defined for each queue.
For this use case, we have to define three queues under the root queue and provide appropriate capacity in percentage for each queue.
Illustration
Below properties are defined in capacity-scheduler.xml
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>research,marketing,finance</value>
</property>
<property>
<name>yarn.scheduler.capacity.research.capacity</name>
<value>60</value>
</property>
<property>
<name>yarn.scheduler.capacity.research.capacity</name>
<value>30</value>
</property>
<property>
<name>yarn.scheduler.capacity.research.capacity</name>
<value>10</value>
</property>

135. How do you setup ResourceManager to use CapacityScheduler?
Answer: We can configure the ResourceManager for using CapacityScheduler by setting the value of property yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler in the file conf/yarn-site.xml.

136. How do you setup ResourceManager to use FairScheduler?
Answer: We can configure the ResourceManager for using FairScheduler by setting the value of property yarn.resourcemanager.scheduler.class to org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler in the file conf/yarn-site.xml.

137. Explain the reliability of Flume-NG data?
Answer: Apache Flume will provide a reliable and distributed system to collect, aggregate and move large amounts of log data from many different sources for a centralizing data store.
This work currently in progress and informally will referred to as Flume NG. It will be gone through two internal milestones – NG Alpha 1, and NG Alpha 2 and a formal incubator release of Flume NG is in the works.
The Core Concept of Flume-NG data are Event, Flow, Client, Agent, Source, Channel and Sink. sThis core concept will make the architecture of Flume NG to achieve this objective

138. How do you benchmark your Hadoop cluster with tools that come with Hadoop?
Answer: TestDFSIO provides an understanding of the I/O performance of cluster. It is a read and write test for HDFS and helpful to identify performance bottlenecks in network, hardware and set up of NameNode and DataNodes.
NNBench
NNBench will simulate requests to create, read, renam and delete files on HDFS and is useful for load testing NameNode hardware configuration
MRBench
MRBench is a test of the MapReduce layer. It will loop a small MapReduce job for a specific number of times and will check the responsiveness and efficiency of the cluster.
Illustration
TestDFSIO will write test with 100 files and file size of 100 MB each.
$ hadoop jar /dirlocation/hadoop-test.jar TestDFSIO -write -nrFiles 100 -fileSize 100
TestDFSIO read test with 100 files. Every file size of 100 MB each.
$ hadoop jar /dirlocation/hadoop-test.jar TestDFSIO -read -nrFiles 100 -fileSize 100
MRBench test to run a lob of 50 small test jobs
$ hadoop jar /dirlocation/hadoop-test.jar mrbench -numRuns 50
NNBench test that creates 1000 files using 12 maps and 6 reducers.
$ hadoop jar /dirlocation/hadoop-test.jar nnbench -operation create_write \
-maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 \
-replicationFactorPerFile 3

139. Assume you are doing a join and you notice that all but one reducer is running for a long time how do you address the problem in Pig?
Answer: Pig will collect all of the records for a provided key together on a single reducer. In many data sets, there are a few keys which have three or more orders of magnitude more records than other keys. This will result in one or two reducers which will take much longer than the rest. To deal with this, Pig gives skew join.
1. In the first MapReduce job pig will scans the second input and identify keys which have many records.
2. In the second MapReduce job, it will do the actual join.
3. For all except the records with the key(s) which are identified from the first job, pig will do a standard join.
4. For the records with keys identified by the second job, based on how many records were seen for a provided key, those records are split across appropriate number of reducers.
5. The other input to the join which is not split, only the keys in question are split and are replicated to each reducer which contains that key.

140. What is the difference between SORT BY and ORDER BY in Hive?
Answer: ORDER BY will perform a total order of the query result set means that all the data can be passed through a single reducer, that will take an unacceptably long time for executing larger data sets.
SORT BY will order the data within each reducer, thereby performs a local ordering, where each output of reducer can be sorted. We will not achieve a total ordering on the dataset. Better performance can be traded for total ordering.
Assume we have a sales table in a company and it has sales entries from salesman around the globe.
Rank each salesperson by country based on the sales volume in Hive:
Hive will support several analytic functions and one of the functions is RANK() and it will be designed to perform this operation.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
Illustration
Hive>SELECT
rep_name, rep_country, sales_volume,
rank() over (PARTITION BY rep_country ORDER BY sales_volume DESC) as rank
FROM
salesrep;

141. What is the benefit of using counters in Hadoop?
Answer: Counters are a useful to gather statistics about the job. Assume we have a 100 node cluster and a job with 100 mappers runs in the cluster on 100 different nodes. Suppose we will like to know each time we see a invalid record in Map phase. We could add a log message in Mapper therefore each time we see an invalid line we can make an entry in the log. But consolidating all the log messages from 100 different nodes are time consuming. We can use a counter instead and increment the value of the counter every time we see an invalid record. The nice thing about using counters is that is provides a consolidate value for the whole job rather than showing 100 separate outputs.

142. What is the difference between an InputSplit and a Block?
Answer: Block is a physical division of data and will not take in to account the logical boundary of records. Meaning we could have a record which started in one block and ends in another block. Where as InputSplit will consider the logical boundaries of records as well.

143. Can you change the number of mappers to be created for a job in Hadoop?
Answer: No. We cannot change the number of mappers to be created for a job in Hadoop. The number of mappers can be determined by the no of input splits.

144. How do you do a file system checks in HDFS?
Answer: FSCK command can be used to do a file system check in HDFS. It is a very useful command for checking the health of the file, block names and block locations.
Illustration
hdfs fsck /dir/hadoop-test -files -blocks –locations

145. What are the parameters of mappers and reducers function?
Answer: Map and Reduce method signature can states a lot about the type of input and ouput Job is deal with. Assuming we are using TextInputFormat, Map function’s parameters can look as :
LongWritable (Input Key)
Text (Input Value)
Text (Intermediate Key Output)
IntWritable (Intermediate Output)
The parameters for reduce function are –
Text (Intermediate Key Output)
IntWritable (Intermediate Value Output)
Text (Final Key Output)
IntWritable (Final Value Output)

146. How do you overwrite replication factor?
Answer: Following illustration shows how overwrite replication factors.
Illustration
hadoop fs -setrep -w 5 -R hadoop-test
hadoop fs -Ddfs.replication=5 -cp hadoop-test/test.csv hadoop-test/test_with_rep5.csv

147. What are the functions of InputFormat?
Answer: Validate input data can be presented and check input configuration
Create InputSplits from blocks
Create RecordReader implementation for creating key/value pairs from the raw InputSplit. These pairs are sent one by one to the mapper.

148. What is a Record Reader?
Answer: A RecordReader will use the data within the boundaries which are created by the input split to generate key/value pairs. Each of the generated Key/value pair can be sent one by one to their mapper.

149. What is a sequence file in Hadoop?
Answer: Sequence file can be used for storing binary key/value pairs. Sequence files will support splitting even when the data inside the file is compressed which can not be possible with a regular compressed file. We can choose for performing a record level compression in which the value in the key/value pair can be compressed. Otherwise we can choose at the block level where multiple records can be compressed together.

150. What daemons are required to run a Hadoop cluster?
Answer: DataNode, NameNode, JobTracker and TaskTracker are required to run a Hadoop cluster.

151. How would you restart a NameNode?
Answer: The easiest way is to click on stop-all.sh for running the command to stop running shell script. After this, click start-all.sh for restarting the NameNode.

152. What are different schedulers available in Hadoop?
Answer:
1. COSHH: It will consider the workload, cluster and the user heterogeneity to schedule decisions.
2. FIFO Scheduler: It will not consider heterogeneity; it will order the job on the basis of arrival time in queue.
3. Fair Sharing: It will define a pool for each user. Users will use their own pools for executing the job.

153. What Hadoop shell commands can be used to perform copy operation?
Answer: Following Hadoop shell commands can be used for performing copy operation:
fs –copyToLocal
fs –put
fs –copyFromLocal.

154. What’s the purpose of jps command?
Answer: It can be used for confirming whether the daemons running Hadoop cluster will be working or not. The output of jps command will reveal the status of DataNode, NameNode, Secondary NameNode, JobTracker and TaskTracker.

155. How many NameNodes can be run on single Hadoop cluster?
Answer: Only one NameNodes can be run on single Hadoop cluster.

156. What will happen when the NameNode on the Hadoop cluster is down?
Answer: The file system goes offline, if the NameNode on the Hadoop cluster is down.

157. Detail crucial hardware considerations when deploying Hadoop in product environment.
Answer:
1. Operating System: 64-bit operating system
2. Capacity: Larger form factor such as3.5” disks will allow more storage and costs less.
3. Network: Two TOR switches are required per rack for better redundancy.
4. Storage: To achieve high performance and scalability, it is required to design a Hadoop platform by moving the compute activity to data.
5. Memory: System’s memory requirements may vary based on the application.
6. Computational Capacity: It will be determined by the total count of MapReduce slots existing across nodes within a Hadoop cluster.

158. Which command will you use to determine if the HDFS (Hadoop Distributed File System) is corrupt?
Answer: Hadoop FSCK (File System Check) command are use to determine if the HDFS is corrupt.

159. How a Hadoop job can be killed?
Answer: Using command kill jobID Hadoop job can be killed.

160. Can filed be copied across multiple clusters? If yes, how?
Answer: Yes, it is possible by distributed copy. DistCP commands are used for intra or inter cluster copying.

161. Recommend the best Operating System to run Hadoop.
Answer: Ubuntu or Linux is the best operating system to run Hadoop. Windows can also be used, but it leads to several problems.

162. How often the NameNode should be reformatted?
Answer: Never, if we reformat the NameNode, it leads to complete data loss. It will be formatted only once, in the beginning.

163. What are Hadoop configuration files and where are they located?
Answer: Hadoop has three different configuration files –
1. mapred-site.xml,
2. hdfs-site.xml,
3. core-site.xml – which are located in “conf” sub directory.

164. How many most common Input Formats defined in Hadoop?
Answer: The following are most common Input Formats defined in Hadoop:
1. KeyValueInputFormat
2.TextInputFormat
3.SequenceFileInputFormat

165. Can we change the file cached by Distributed Cache in Hadoop?
Answer: No, We can not change the file cached by Distributed Cache in Hadoop because the DistributedCache will track the caching with timestamp a cached file should not be changed during the job execution.

166. What do you mean by Distributed Cache in mapreduce framework?
Answer: The distributed cache is a very effective feature can provide by the map reduce framework. The Distributed cache can cache archive, text, jars which would be used by application for increasing performance. Application will provide complete information’s of jobconf object to cache.

167. How many the number of modes that supported by Hadoop?
Answer: The three modes in which Hadoop are used that supported by Hadoop:
1. Fully distributed mode
2. Standalone mode
3. Pseudo-distributed mode

168. How may Daemon processes run on a Hadoop system?
Answer: Every daemon will run in own Java Virtual Machine (JVM).
Following three Daemon are:
1. Master nodes NameNode
2. Secondary NameNode
3. Job Tracker

169. What is the difference between a NAS and HDFS?
Answer: In NAS data can be stored on dedicated hardware. In HDFS data blocks can be distributed across local drives of all machines in a cluster.
A NAS is not suitable for MapReduce because data can be stored separately from the computations. HDFS is designed to work with MapReduce system, because computation can be moved to the data.

170. For a key and value class what is the Hadoop MapReduce APIs contract?
Answer: There are following Hadoop MapReduce API contract for a key and value class:
1. The value can be defining the org.apache.hadoop.io.Writable interface.
2. The key can be defining the org.apache.hadoop.io.WritableComparable interface.

171. How the client interacts with Hadoop Distributed File System?
Answer: The Client will interact to Hadoop Distributed File System utilizing HDFS API. The Client applications will talk to the NameNode whenever they will wish for finding and when they will need for adding, copying, moving, pasting, and deleting a file on HDFS.

172. Can you necessary to write jobs for Hadoop in Java language?
Answer: No, there are various techniques for dealing with non-java codes. A Hadoop Streaming provide any shell command are be used as a map or reduce function.

173. How to debug Hadoop code?
Answer: There are various methods for debugging Hadoop codes but the most popular methods are:
1. By using web interface given by Hadoop framework.
2. By using Counters.

174. What do you mean by combiner in Hadoop?
Answer: The combiner is a mini_reduce technique that will operate only on data generated by a Mapper. When the Mapper will emit the data combiner can be received it as input and the output to reducer.

175. Which type of hardware is best for Hadoop?
Answer: A Hadoop will run on dual processor or dual core machines with 4-8 GB RAM utilizing ECC memory. It will depend on the workflow requires.

176. Which platform and Java version is required to run Hadoop?
Answer: The Java 1.6.x or higher version is best for Hadoop technology, preferably from the Sun.

177. Who is invented by Hadoop?
Answer: Hadoop is invented by Doug and Mike Cafarella.

178. How well does Hadoop scale?
Answer: A Hadoop are considered in clusters of up to 4000 nodes. Sort performance on 900 nodes is best and will b developed using these non-default configuration values:
dfs.block.size = 839810502
dfs.namenode.handler.count = 50
mapred.reduce.parallel.copies = 40
mapred.child.java.opts = -Xmx512m
fs.inmemory.size.mb = 300
io.sort.factor = 200
io.sort.mb = 300
io.file.buffer.size = 170441603
Mapred.job.tracker.handler.count = 70The sort performance on 1400nodes and 2000 nodes are good too-sorting 14TB of data on a 1400-node cluster will take 3.2 hours; sorting 20TB on a 2000 node cluster takes 3.5 hours. The post To the above configuration is:
Mapred.reduce.parallel.copies = 60
Tasktracker.http.threads = 60
Mapred.child.java.opts = -Xmx1024m

179. Suppose you have a Research Marketing and Finance teams funding 70%, 40% and 20% continuously of your Hadoop Cluster. How will you define only 70% of cluster resources to Research, 40% to Finance and 20% to Marketing during peak load?
Answer: Given the properties will be assigned in capacity-scheduler.xml
<property>
<name>meraj.scheduler.capacity.root.queuse</name>
<value>research, finance,marketing</value>
</property>

<property>
<name>meraj.scheduler.capacity.research.capacity</name>
<value>70</value>
</property>

<property>
<name>meraj.scheduler.capacity.finance.capacity</name>
<value>40</value>
</property>

<property>
<name>meraj.scheduler.capacity.marketing.capacity</name>
<value>20</value>
</property>

180. What is the main benefit of using counters in Hadoop?
Answer: The main benefit of using counters in Hadoop is collecting statistics about the job.

181. What is the main difference between an InputSplit and a Block?
Answer: The block is a physical division of data and will not take into account the logical boundary of records. InputSplit will consider the logical boundaries of records.

182. Can you change the number of mappers to be defined for a job in Hadoop?
Answer: No, we cannot change the number of mappers to be defined for a job in Hadoop. The number of mappers is considered by the number of input splits.

183. How do you a file system check in HDFS?
Answer: FSCK command is used for doing a file system check in HDFS.

184. How do you overwrite replication factor in Hadoop?
Answer: There are few ways to overwrite replication factor in Hadoop:
Hadoop fs –seterp –w 5 –R hadoop-test
Hadoop fs –Ddfs.replication=5–cp hadoop-meraj/meraj.csv hadoop-meraj/meraj_with_rep5.csv

185. Explain Monad class?
Answer: Monad class is a class to wrap of objects. Example for identity with Unit & Bind with Map. It will provide two operations as below:-
identity (return in Haskell, unit in Scala)
bind (>>= in Haskell, flatMap in Scala)
Scala will not have a built-in monad type, therefore we required to model the monad ourselves. However other subsidiaries of Scala such as Scalaz have the monad built-in itself also it will come with theory family such as applicatives , functors, monoids and so on.
The sample program to model monad with generic trait in Scala that will provide method like unit() and flatMap() is below.
trait M[A]
{
defflatMap[B](f: A => M[B]): M[B]
}
def unit[A](x: A): M[A]

186. What is WebDAV in Hadoop?
Answer: WebDAV is a set of extension to HTTP to support editing and updating files. The most operating system WebDAV shares are mounted as filesystems, Therefore it is possible to find HDFS as a standard filesystem by display HDFS over WebDAV.

187. What is the sqoop in Hadoop?
Answer: Sqoop is a tool to transfer the data between Relational database management and Hadoop HDFS. By using Sqoop data are converted from RDMS like MySQL, Oracle into HDFS and exporting data from HDFS file to Relational Database Management.

188. What do you mean by Sequencefileinputformat?
Answer: A Sequencefileinputformat is used to read file in sequence order. It is specific to the compressed binary file format that is optimized for other data between the outputs of one MapReduce job to input of some other MapReduce job.

189. What does the conf.setMapperClass do?
Answer: The sonf.setMapperClass will set the mapper class and the entire deep related to map job such as reading data and generating a key-value pair out of the mapper.

190. What happens in textformat?
Answer: In textformat the value is the content of the line and key is the byte offset of the line.Every line in the text file is a record in text format.
Example, Key: longWritable, Value:text

191. What is the use of Context object?
Answer: The Context object will provide the mapper to impress with the rest of the Hadoop system. It will include configuration data for the job and interfaces that give it to emit output.

192. What happens if numbers of reducers are zero?
Answer: If numbers of reducers are zero, the outputs of the map-tasks will go directly to the FileSystem, and will convert the output path set by setOutputPath(Path). Hadoop frameworks will not sort the map outputs before writing them out to the FileSystem.

193. How many methods of JobTracker can run on a Hadoop Cluster?
Answer: One method of JobTracker runs on a Hadoop Cluster.

194. What is TaskInstance?
Answer: The TaskInstaces are the real MapReduce jobs that will run on every slave node. TaskTracker will start a separate Java Virtual Machine (JVM) processes to perform the real work is known as TaskInstance. The TaskInstance can run on its own JVM process.

195. Can Reducer talk with each other?
Answer: No, Reducer cannot talk with each other because it work in isolation.

196. What command to create a directory in HDFS via FS shell?
Answer: The commands for creating a directory in HDFS via FS shell:
bin/hadoop dfs-mkdir/<directory_name>

197. What are the advanced features of HDFS?
Answer: Following are the advance features of HDFS
1. Files are stored as blocks
2. Provides reliability through replication
3. Block is stored on slave nodes
4. Operates on top of an existing file system
5. Single NameNode daemon stores metadata and co-ordinates access

198. Explain Big Data and what are five V’s of Big Data?
Answer: Big data is a collection of large and complex data sets that is difficult for processing using relational database management tools or traditional data processing applications.
Following are the five V’s of Big Data
1. Volume: The volume will represent the amount of data that is growing at an exponential rate for example in Petabytes and Exabytes.
2. Velocity: Velocity will refer to the rate at which data is growing very fast. Today, yesterday’s data are referred as old data. Social media is a major contributor for growing data.
3. Variety: Variety will refer to the heterogeneity of data types. The data that are gathered has a variety of formats such as videos, audios, csv, etc. Therefore, these various formats will represent the variety of data.
4. Veracity: Veracity will refer to the data in doubt or uncertainty of data available due to data inconsistency and incompleteness. Data available are sometimes messy and are difficult to trust. Quality and accuracy are difficult to control. The volume is the reason behind for the lack of quality and accuracy in the data.
5. Value: It is all good to have access to big data but unless we will turn it into a value it is useless.

199. What are Hadoop’s components.
Answer: Following are Hadoop’s components:
Storage unit– HDFS (NameNode, DataNode)
Processing framework– YARN (ResourceManager, NodeManager)

200. Tell me about the various Hadoop daemons and their roles in a Hadoop cluster.
Answer: Following are the various Hadoop daemons and their roles in a Hadoop cluster:
1. NameNode: It is the master node which is responsible to store the metadata of all the files and directories. It has information regarding blocks, which will make a file, and where those blocks can be located in the cluster.
2. Datanode: It is the slave node which will contain the actual data.
3. Secondary NameNode: It will periodically merge the changes edit log with the FsImage Filesystem Image, which is present in the NameNode. It can store the modified FsImage into persistent storage, which can be used in case of failure of NameNode.
4. ResourceManager: It is the central authority which will manage resources and will schedule applications running on top of YARN.
5. NodeManager: It will run on slave machines. It is responsible to launch the application’s containers where applications will execute their part, monitor their resource, memory, disk, network and report these to the ResourceManager.
6. JobHistoryServer: It will maintain information regarding MapReduce jobs after the Application Master terminates.

201. Compare HDFS with Network Attached Storage (NAS).
Answer: Compare HDFS with Network Attached Storage (NAS) features as follows:
• Network-attached storage (NAS) is a file-level computer data storage server which will connect to a computer network. It can provide data access to a heterogeneous group of clients. NAS can either be a hardware or software that will provide services to store and access files. Whereas Hadoop Distributed File System (HDFS) is a distributed filesystem to store data which uses commodity hardware.
• In HDFS Data Blocks are distributed across all the machines as a cluster. In NAS data will be stored on a dedicated hardware.
• HDFS are designed to work with MapReduce paradigm, where computation can moved to the data. NAS will not be suitable for MapReduce because data can be stored separately from the computations.
• HDFS can use commodity hardware which is cost-effective, and a NAS is a high-end storage devices that includes high cost.

202. List the difference between Hadoop 1 and Hadoop 2.
Answer: To answer this question it is required to mainly focus on two points i.e. Passive NameNode and YARN architecture.
1. In Hadoop 1.x, NameNode is the single point of failure. In Hadoop 2.x, it has Active and Passive NameNodes. If the active NameNode will fail, the passive NameNode will take charge. Therefore availability can be achieved in Hadoop 2.x.
2. In Hadoop 2.x, YARN will provide a central resource manager. With YARN, we will run multiple applications in Hadoop, all sharing a common resource. MRV2 is a particular type of distributed application which will run the MapReduce framework on top of YARN. Other tools will also perform data processing via YARN, that was a problem in Hadoop 1.x.

203. What are active and passive “NameNodes”?
Answer: In High Availability architecture, It has two NameNodes – Active NameNode and Passive NameNode.
• Active NameNode is the NameNode that will work and run in the cluster.
• Passive NameNode is a standby NameNode, that has similar data as active NameNode.
When the active NameNode will fail, the passive NameNode can replace the active NameNode in the cluster. Thus, the cluster will never without a NameNode and so it never fails.

204. What happens when two clients try to access the same file in the HDFS?
Answer: HDFS will support exclusive write only.
When the first client will contact the NameNode for opening the file for writing, the NameNode will grant a lease to the client for creation of this file. If the second client will try to open the same file to write, the NameNode will notice that the lease for the file is granted to another client, and rejects the open request for the second client.

205. How does NameNode tackle DataNode failures?
Answer: NameNode periodically can receive a Heartbeat signal from each of the DataNode in the cluster, that imply DataNode can function properly.
A block report will contain a list of all the blocks on a DataNode. If a DataNode will fail to send a heartbeat message, after a specific period of time it will be marked as dead.
The NameNode will replicate the blocks of dead node to another DataNode using the replicas will create earlier.

206. What will you do when NameNode is down?
Answer: The NameNode recovery process will involve the following steps for making the Hadoop cluster up and running:
A. Use the file system metadata replica FsImage for starting a new NameNode.
B. Configure the DataNodes and clients therefore they acknowledges this new NameNode, which is started.
C. The NameNode will serve the client after it will complete loading the last checkpoint FsImage for metadata information. It will received enough block reports from the DataNodes.
This recovery process is very time consuming for large Hadoop clusters.
This will become even a greater challenge in the case of the routine maintenance.

207. How is HDFS fault tolerant?
Answer: When data can be stored over HDFS, NameNode will replicate the data to several DataNode. The default replication factor is three. We can change the configuration factor as per requirement. If a DataNode will go down, the NameNode can automatically copy the data to another node from the replicas and will make the data available. This gives fault tolerance in HDFS.

208. Can NameNode and DataNode be commodity hardware?
Answer: DataNodes are commodity hardware such as personal computers and laptops as it will stores data and required in a large number. NameNode is the master node and it will store metadata about all the blocks stored in HDFS. It needs high memory (RAM) space, therefore NameNode required to be a high-end machine with good memory space.

209. How do you define “block” in HDFS? Mention the default block size in Hadoop 1 and in Hadoop 2? Can it be changed?
Answer: Blocks are the smallest continuous location on hard drive where data is stored. HDFS can store each as blocks, and it is distributed across the Hadoop cluster. Files in HDFS will be broken down into block-sized chunks, that can be stored as independent units.
1. Hadoop 1 default block size: 64 MB
2. Hadoop 2 default block size: 128 MB
Yes, We can configured blocks size. The dfs.block.size parameter are used in the hdfs-site.xml file for setting the size of a block in a Hadoop environment.

210. What does ‘jps’ command do?
Answer: The ‘jps’ command can help us for checking, if the Hadoop daemons are running. It will show all the Hadoop daemons example for namenode, datanode, resourcemanager, nodemanager etc. which are running on the machine.

211. Define Rack Awareness in Hadoop?
Answer: Rack Awareness is an algorithm in that the NameNode will decide the method to place blocks and their replicas, based on rack definitions for minimizing network traffic between DataNodes within the same rack.
Assume we will consider replication factor 3 by default, the policy is that “for every block of data, two copies will exist in one rack, third copy in a different rack”. This rule is called as the “Replica Placement Policy”

212. How can I restart “NameNode” or all the daemons in Hadoop?
Answer: We can restart NameNode by following methods:
1 . We will stop the NameNode individually using. /sbin /hadoop-daemon.sh stop namenode command and start the NameNode using. /sbin/hadoop-daemon.sh start namenode command.
2 . For stoping and starting all the daemons, use. /sbin/stop-all.sh and then use ./sbin/start-all.sh command that can stop all the daemons first and then will start all the daemons.
These script files will reside in the sbin directory inside the Hadoop directory.

213. What is the difference between an “HDFS Block” and an “Input Split”?
Answer: The physical division of the data is called HDFS Block and the logical division of the data is called Input Split. HDFS will divide data in blocks to store the blocks together, whereas To process, MapReduce will divides the data into the input split and can assign it to mapper function.

214. Name the three modes in which Hadoop can run.
Answer: The following modes in which Hadoop can run:
1. Standalone (local) mode: This is the default mode. In this mode, all the components of Hadoop, such as NameNode, DataNode, ResourceManager, and NodeManager, run like a single Java process. This will use the local filesystem.
2. Pseudo-distributed mode: A single-node Hadoop deployment will be considered as running Hadoop system in pseudo-distributed mode. All the Hadoop services, both the master and the slave services also, are executed on a single compute node.
3. Fully distributed mode: A Hadoop deployments in that the Hadoop master and slave services will run on separate nodes, are stated as fully distributed mode.

215. What is “MapReduce”? Mention the syntax to run a “MapReduce” program?
Answer: It is a framework/a programming model which is used to process large data sets over a cluster of computers using parallel programming. The syntax for running a MapReduce program is hadoop_jar_file.jar /input_path /output_path.

216. What are the main configuration parameters in a “MapReduce” program?
Answer: The main configuration parameters that users required to specify in MapReduce framework are following :
1. Job’s input locations in the distributed file system
2. Job’s output location in the distributed file system
3. Input format of data
4. Output format of data
5. Class containing the map function
6. Class containing the reduce function
7. JAR file containing the mapper, reducer and driver classes

217. How do reducers communicate with each other?
Answer: The MapReduce programming model will not allow reducers to communicate with each other. Reducers will run in isolation.

218. What does a “MapReduce Partitioner” do?
Answer: A MapReduce Partitioner will make sure that all the values of a single key go to the same reducer, and allow even distribution of the map output over the reducers. It will redirect the mapper
output to the reducer and determine which reducer is responsible for the particular key.

219. State the reason why we can’t perform “aggregation” (addition) in mapper? Why do we need the “reducer” for this?
Answer: Due to following reason, we can’t perform “aggregation” (addition) in mapper.
1. Because sorting will not occur in the mapper function. Sorting will occur only on the reducer side and without sorting aggregation cannot be done.
2. During aggregation, we required the output of all the mapper functions that will not be possible to collect in the map phase as mappers may be running on the different machine where the data blocks will be stored.
3. We will try to aggregate data at mapper, it will require communication between all mapper functions which may be running on different machines. Therefore, it will consume high network bandwidth and it leads to network bottlenecking.

220. How will you write a custom partitioner?
Answer: Custom partitioner for a Hadoop job will be written easily by following steps:
1. Create a new class which will extend Partitioner Class.
2. Override method : getPartition, in the wrapper which will run in the MapReduce.
3. Add the custom partitioner to the job by method will set Partitioner or add the custom partitioner to the job as a config file.

221. What is a “Combiner”?
Answer: A Combiner is a mini reducer which will perform the local reduce task. It will receive the input from the mapper on a particular node and will send the output to the reducer. Combiners
can help in enhancing the efficiency of MapReduce by reducing the quantum of data which is needed to be sent to the reducers.

222. Explain in detail about Kafka Producer in context to Hadoop?
Answer: Kafka is an open source API cluster to process stream data.
Kafka will Include these Core API’s –
1. Producer API,
2. Consumer API,
3. Streams API,
4. Connect API
The can use cases of Kafka API’s are – Website Activity Tracking, Messaging, Metrics, Log Aggregation, , Event Sourcing , Stream Processing and Commit Log.
These API’s are mainly used to Publish and to Consumef Messages using Java Client.
Kafka Producer API (Apache) has a class called KafkaProducer that will facilitate Kafka broker in its constructor and give following methods- Send Method, Flush Method and Metrics.
Send Method-
Example for -producer.send(new ProducerRecord<byte[],byte[]>(topic, partition, key, value) , Usercallback);
In the above example code-
ProducerRecord – This is a producer class that will manages a buffer of records waiting to be sent that requires topic, partition , key and value are parameters.
UserCallback – It is a User callback function for executing when the record has been acknowledged by the server. If it is null which will mean there is no callback.
Flush Method – this Method can be used to send messages.
Example for public void flush ()
Metrics – It will provide partition to get the Partition metadata to provided topic in runtime. This method can also used for custom partitioning.
Example for public Map metrics()
After execution of all the methods, we required to call the close method after sent request will be completed.
Example for public void close()
Overview of Kafka Producer API’s:
There are two types of producers example for Synchronous (Sync) and Asynchronous (Async)
Sync – This Producer will send message directly along with other execution messages in background.
Example for kafka.producer.SyncProducer
Async- Kafka will provide an asynchronous send method for sending a record to a topic. The big difference is between Sync and Async is which
we have to use a lambda expression for defining a callback.
Example for kafka.producer.async.AsyncProducer.
Example Program-
class Producer
{
/* the data which is partitioned by key to the topic is sent using either the synchronous or the asynchronous producer */
public void send(kafka.javaapi.producer.ProducerData<K,V>producerData);
public void send(java.util.List<kafka.javaapi.producer.ProducerData<K,V>>producerData);
/* the producer to clean up */
public void close();
}

223. What are the benefits of Apache Pig over MapReduce?
Answer: Apache Pig is a platform that is used for analyzing large data sets which represent them as data flows developed by Yahoo. It is designed for giving an abstraction over MapReduce, reducing the complexities of writing a MapReduce program.
• Pig Latin is a high-level data flow language and MapReduce is a low-level data processing paradigm.
• Without writing complex Java implementations in MapReduce, programmers will achieve the same implementations very easily using Pig Latin.
• Apache Pig can reduce the length of the code by approx 20 times according to Yahoo. This will reduce the development period by almost sixteen times.
• Pig will provide many built-in operators for supporting data operations such as joins, filters, ordering, sorting etc. for performing the same function in MapReduce is a humongous task.
• A Join operation in Apache Pig is simple. It is difficult in MapReduce for performing a Join operation between the data sets, because it needs multiple MapReduce tasks are executed sequentially to fulfill the job.
• Pig will provide nested data types such as tuples, bags, and maps which are missing from MapReduce.

224. What are the Hadoop Pig data types?
Answer: Hadoop Pig will run both atomic data types and complex data types.
Atomic data types: These are the basic data types that will be used in all the languages such as int, string, float, long, etc.
Complex Data Types: These will be Bag, Map, and Tuple.

225. List the various relational operators used in “Pig Latin”?
Answer: Following are the various relational operators used in pig Latin:
1. SPLIT
2. LIMIT
3. CROSS
4. COGROUP
5. GROUP
6. STORE
7. DISTINCT
8. ORDER BY
9. JOIN
10. FILTER
11. FOREACH
12. LOAD

226. What is a UDF?
Answer: If some functions will be unavailable in built-in operators, we programmatically create User Defined Functions (UDF) to bring those functionalities using other languages such as Java, Python, Ruby, etc. and embed it in Script file.

227. What is “SerDe” in “Hive”?
Answer: Apache Hive is a data warehouse system which is built on top of Hadoop and can be used for analyzing structured and semi-structured data developed by Facebook. Hive will abstract the complexity of Hadoop MapReduce.
The “SerDe” interface allows to instruct Hive about how a record can be processed. A combination of a Serializer and a Deserializer is serde. Hive will use SerDe and FileFormat to read and write the row of table.

228. Can we use the default “Hive Metastore” by multiple processes at the same time?
Answer: Derby database is the default Hive Metastore. Multiple processes will not access it at the simultaneously. It will be used to perform unit tests.

229. What is the default location where “Hive” stores table data?
Answer: Inside HDFS in /user/hive/warehouse location Hive stores table data by default.

230. What is Apache HBase?
Answer: HBase is an open source, multidimensional, distributed, and scalable and a NoSQL database which is written in Java. HBase will run on top of Hadoop Distributed File System and gives BigTable (Google) such as capabilities to Hadoop. It is designed for providing a fault-tolerant way to store the large collection of sparse data sets. HBase will achieve high throughput and low latency by giving faster Read/Write access on huge datasets.

231. What are the components of Apache HBase?
Answer: HBase has following major components:
Region Server: A table will be divided into several regions. A group of regions can be served to the clients by a Region Server.
HMaster: It will coordinate and will manage the Region Server similar as NameNode manages DataNode in HDFS.
ZooKeeper: Zookeeper will act such as a coordinator inside HBase distributed environment. It will help in maintaining server state inside the cluster to communicate through sessions.

232. What are the components of Region Server?
Answer: The following are components of a Region Server:
WAL: Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL will store the new data which hasn’t been persisted or committed to the permanent storage.
Block Cache: Block Cache will reside in the top of Region Server. It will store the frequently read data in the memory.
MemStore: It is the write cache and It will store all the incoming data before committing to the disk or permanent memory. There is one MemStore To each column family in a region.
HFile: HFile will be stored in HDFS and will store the actual cells on the disk.

233. Explain “WAL” in HBase?
Answer: Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL will store the new data that hasn’t been persisted or committed to the permanent storage. It is used in case of failure to recover the data sets.

234. Mention the differences between “HBase” and “Relational Databases”?
Answer: HBase is an open source, multidimensional, distributed, scalable and a NoSQL database is written in Java. HBase will run on top of HDFS and gives BigTable such as capabilities to Hadoop.

HBaseRelational Database
schema-lessIt is schema-based database
Column-oriented data storeRow-oriented data store
Used for storing de-normalized dataUsed for storing normalized data
Contain sparsely populated tablesContain thin tables
Automated partitioningNo such provision

235. What is Apache Spark?
Answer: Apache Spark is a framework in distributed computing environment for real-time data analytics. It will execute in-memory computations for increasing the speed of data processing.
For large-scale data processing by exploiting in-memory computations and other optimisations, It is 100x faster than MapReduce.

236. Can you build “Spark” with any particular Hadoop version?
Answer: Yes, we can build “Spark” for a specific Hadoop version.

237. Define RDD.
Answer: RDD is stands for Resilient Distribution Datasets – a fault-tolerant collection of operational elements which will run parallel. The partitioned data in RDD are immutable and distributed, that is a key component of Apache Spark.

238. What is Apache ZooKeeper and Apache Oozie?
Answer: Apache ZooKeeper will coordinate with various services in a distributed environment. It can save a lot of time by performing synchronization, configuration maintenance, grouping and naming.
Apache Oozie is a scheduler that will schedule Hadoop jobs and will bind them together as one logical work. There are following kind of Oozie jobs:
Oozie Workflow: These Oozie jobs are the sequential set of actions to be executed such as a relay race, where each athlete has to wait for the last athelete to complete his part.
Oozie Coordinator: These Oozie jobs are triggered when the data is made available to it such as response-stimuli system in our body. We respond to an external stimulus, similarly an Oozie coordinator will responds to the availability of data and it will rest otherwise.

239. How do you configure an “Oozie” job in Hadoop?
Answer: Oozie is integrated with the rest of the Hadoop stack which support several types of Hadoop jobs like Java MapReduce, Streaming MapReduce, Pig, Hive and Sqoop.

240. RDBMS vs Hadoop
Answer:

NameRDMSHadoop
Data volumeRDBMS does not store and process a large amount of dataHadoop will work  for large amounts of data.
ThroughputRDBMS will fail to achieve a high ThroughputHadoop can achieve high Throughput
Data varietySchema of the data is available in RDBMS and it will always depend on the structured data.It will store all kind of data like structured, unstructured or semi-structured.
Data processingRDBMS will support OLTP(Online Transactional Processing)Hadoop will support OLAP(Online Analytical Processing)
Read/Write SpeedReads will be fast in RDBMS because the schema of the data is  known.Writes will be fast in Hadoop because no schema validation required during HDFS write.
Schema on reading Vs WriteRDBMS has to follow schema on write policyHadoop has to follow the schema on reading policy
CostRDBMS is a licensed softwareHadoop is a free and open-source framework

241. What is the difference between a regular file system and HDFS?
Answer:

Regular File SystemsHDFS
A small block size of data such as

512 bytes

Large block size orders of 64mb
Multiple disks will seek large filesReads data sequentially after single seek

242. What is Avro Serialization in Hadoop?
Answer: The process to translate objects or data structures state into binary or textual form is called Avro Serialization. It is defined as a language-independent schema which is written in JSON.
It will give AvroMapper and AvroReducer for running MapReduce programs.

243. How can you skip the bad records in Hadoop?
Answer: Hadoop will provide a feature called SkipBadRecords class to skip bad records while processing mapping inputs.

244. What are the features of HDFS?
Answer: Following are the features of HDFS:
1. Supports storage of very large datasets
2. Write once read many access model
3. Streaming data access
4. Replication using commodity hardware
5. HDFS is highly Fault Tolerant
6. Distributed storage

245. What is the HDFS block size?
Answer: The HDFS block size is 128MB for Hadoop 2.x.

246. What is the default replication factor?
Answer: Replication factor is the minimum fnumber of times the file will replicate (copy) across the cluster. The default replication factor is three.

247. List the various HDFS Commands?
Answer: The Various HDFS Commands are listed bellow:
1. mkdir
2. ls
3. put
4. copy from local
5. get
6. copyToLocal
7. catP
8. mv
9. cp

248. Compare HDFS (Hadoop Distributed File System) and NAS (Network Attached Storage)?
Answer:

HDFSNAS
Distributed file system used for storing data by commodity hardware.File-level computer data storage server that is  connected to a computer network, and will give network access to a heterogeneous group of clients.
Include commodity hardware which is cost-effectiveHigh-end storage device which includes a high cost.
Work for the MapReduce paradigm.Not suitable for MapReduce.

249. What are the limitations of Hadoop 1.0?
Answer: Following are the limitation of Hadoop 1.0
1. NameNode: No Horizontal Scalability and No High Availability
2. Job Tracker: Overburdened.
3. MRv1: Only understand Map and Reduce tasks

250. Compare Hadoop 1.x and Hadoop 2.x
Answer:

            Name            Hadoop 1.x            Hadoop 2.x
            1. NameNode            NameNode is the single point of failure            have both Active and passive NameNodes.
            2. Processing            MRV1 Job Tracker and Task Tracker            MRV2/YARN ResourceManager and NodeManager

251. List the different types of Hadoop schedulers.
Answer: Following are the types of Hadoop Schedulers:
1. Hadoop FIFO scheduler schedulers
2. Hadoop Fair Scheduler
3. Hadoop Capacity Scheduler

252. How to keep an HDFS cluster balanced?
Answer: Balancer Tools are used to keep an HDFS cluster balanced. This tool will try to subsequently even out the block data distribution across the cluster.

253. What is DistCp?
Answer: DistCp tool is used for copying large amounts of data to and from Hadoop file systems in parallel. It will use MapReduce to effect its distribution, reporting, recovery, and error handling.

254. What is HDFS Federation?
Answer: HDFS Federation will enhance the present HDFS architecture through a clear separation of namespace and storage when enable a generic block storage layer. It will provide multiple namespaces in the cluster for improving scalability and isolation.

255. What is RAID?
Answer: RAID stands for redundant array of independent disks. It is a data storage virtualization technology used to improve performance and data redundancy by combination of multiple disk drives into a single entity.

256. Does Hadoop requires RAID?
Answer: In DataNode of Hadoop, RAID is not required because storage is achieved by replication between the Nodes. In NameNode of Hadoop disk RAID is required.

257. What is rack-aware replica placement policy?
Answer: Rack Awareness is the algorithm used to improve the network traffic when reading/writing HDFS files to Hadoop cluster by NameNode. NameNode will choose the Datanode that is closer to the same rack or nearby rack to read/Write request. The concept of choosing closer data nodes will based on racks information is called Rack Awareness.
If replication factor is three for data blocks on HDFS, then for every block of data two copies will be stored on the same rack, while the third copy will be stored on a different rack. This rule is called Replica Placement Policy.

258. What is the main purpose of Hadoop fsck command? Answer: Hadoop fsck command is used to check the HDFS file system.
Answer: There are following arguments which can be passed with this command to provide different results.
1. Hadoop fsck / -files: Displays all the files in HDFS while checking.
2. Hadoop fsck / -files -blocks: Displays all the blocks of the files while checking.
3. Hadoop fsck / -files -blocks -locations: Displays all the files block locations while checking.
4. Hadoop fsck / -files -blocks -locations -racks: Displays the networking topology for data-node locations.
5. Hadoop fsck -delete: Deletes the corrupted files in HDFS.
6. Hadoop fsck -move: Moves the corrupted files to a particular directory.

259. What is the purpose of a DataNode block scanner?
Answer: The purpose of the DataNode block scanner is to operate and periodically check all the blocks which will be stored on the DataNode. If bad blocks will be detected it can be fixed before any client reads.

260. What is the purpose of dfsadmin tool?
Answer: Following are the purpose of dfsadmin tool:
dfsadmin tool is used to examine the HDFS cluster status.
dfsadmin – report command will produce useful information regarding basic statistics of the cluster like DataNodes and NameNode status, disk capacity configuration, etc.
It will perform all the administrative tasks on the HDFS.

261. What is the command used for printing the topology?
Answer: Hdfs dfsadmin -point topology is used to print the topology. It will display the tree of racks and DataNodes which is attached to the tracks.

262. List the various site-specific configuration files available in Hadoop?
Answer: Following are the various site-specific configuration files available in Hadoop:
1. conf/Hadoop-env.sh
2. conf/yarn-site.xml
3. conf/yarn-env.sh
4. conf/mapred-site.xml
5. conf/hdfs-site.xml
6. conf/core-site.xml

263. What is the main functionality of NameNode?
Answer: Following are the main functionality of NameNode:
Namespace – Manages metadata of HDFS.
Block Management – Processes and manages the block reports and its location.

264. Which command is used to format the NameNode?
Answer: $ hdfs namenode –format

265. Explain the way how a client application interacts with the NameNode?
Answer: Client applications will associate the Hadoop HDFS API with the NameNode then it has to copy/move/add/locate/delete a file.
The NameNode will return to the successful request by sending a list of relevant DataNode servers where the data is residing.
The client will talk directly to a DataNode after the NameNode will be given the location of the data.

266. What is MapReduce and list its features?
Answer: MapReduce is a programming model which is used for processing large datasets on the clusters with parallel and distributed algorithms.
The syntax to run the MapReduce program is
hadoop_jar_file.jar /input_path /output_path.

267. What are the features of MapReduce?
Answer: Following are the features of MapReduce:
1. Automatic parallelization and distribution.
2. Built-in fault-tolerance and redundancy are available.
3. MapReduce Programming model is language independent
4. Distributed programming complexity is hidden
5. Enable data local processing
6. Manages all the Inter-Process Communication

268. What do MapReduce framework consists of?
Answer: MapReduce framework is used for writing applications to process large data in parallel on large clusters of commodity hardware.
It consists of:
1. ResourceManager (RM)
Global resource scheduler
One master RM
2. NodeManager (NM)
One slave NM per cluster-node.
3. Container
RM creates Containers upon request by AM
The application runs in one or more containers
4. ApplicationMaster (AM)
One AM per application
Runs in Container

269. What are the two main components of ResourceManager?
Answer:
1. Scheduler: It will allocate the resources to various running applications which is based on resource availability and configured shared policy.
2. ApplicationManager: Responsible to manage a collection of submitted applications

270. What is a Hadoop counter?
Answer:
Hadoop Counters will measure the progress or track the number of operations which will occur within a MapReduce job. Counters will be useful to collect statistics about MapReduce job for application-level or quality control.

271. What are the main configuration parameters for a MapReduce application?
Answer:
The job configuration will require the following:
1. Job’s input and output locations in the distributed file system
2. The input format of data
3. The output format of data
4. Class will contain the map function and reduce function
5. JAR file contain the reducer, driver, and mapper classes

272. What are the steps involved to submit a Hadoop job?
Answer:
Following Steps are involved in Hadoop job submission:
1. Hadoop job client submits the job jar/executable and configuration to the ResourceManager.
2. ResourceManager then distributes the software/configuration to the slaves.
3. ResourceManager then scheduling tasks and monitoring them.
4. Finally, job status and diagnostic information are provided to the client.

273. How can MapReduce framework view input internally?
Answer: It will view the input data set as a set of pairs and will process the map tasks in a completely parallel manner.

274. What are the basic parameters of Mapper?
Answer:
Following are the basic parameters of Mapper:
1. LongWritable and Text
2. Text and IntWritable

275. Why aggregation cannot be performed in Mapperside?
Answer: Aggregation needs sorting of data that occurs only at Reducer side therefore we cannot perform It in mapperside.
We required the output from all the mapper functions for aggregation that is not possible during the map phase because map tasks run in different nodes, where data blocks are present.

276. How do reducers communicate with each other in Hadoop?
Answer: Reducers never communicate with each other in Hadoop because it always run in isolation and Mapreduce programming paradigm will not allow them to communicate with each other.

277. What is Identity Mapper?
Answer: Identity Mapper is a default Mapper class that will automatically work, if no Mapper is specified in the MapReduce driver class. It will implement mapping input directly into the output. IdentityMapper.class are used as a default value when JobConf.setMapperClass is not set.

278. What is the purpose of MapReduce Partitioner in Hadoop?
Answer: The MapReduce Partitioner will manage the partitioning of the key of the intermediate mapper output. It will make sure that all the values of a single key pass to similar reducers by allowing the even distribution over the reducers.

279. How will you write a custom partitioner for a Hadoop MapReduce job?
Answer: Following method to write are a custom partitioner for a Hadoop MapReduce job:
1. Build a new class that will extend Partitioner Class
2. Override the get partition method in the wrapper.
3. Add the custom partitioner to the job as a config file
or by using the method set Partitioner.

280. What is a Combiner?
Answer: A Combiner is a semi-reducer which will execute the local reduce task. It can receive inputs from the Map class and will pass the output key-value pairs to the reducer class.

281. What are Writables and explain its importance in Hadoop?
Answer: Writables are interfaces in Hadoop. They will act as a wrapper class to almost all the primitive data types of Java.
A serializable object which will execute a simple and efficient serialization protocol, based on DataInput and DataOutput.
Writables can be used to create serialized data types in Hadoop.

282. What is the reason for comparison of types is important for MapReduce?
Answer: It is important to MapReduce as in the sorting phase the keys will be compared with one another.
For Comparison of types, WritableComparable interface is implemented.

283. What are the methods used for restarting the NameNode in Hadoop?
Answer: The methods used to restart the NameNodes are the following:
We can use /sbin/hadoop-daemon.sh stop namenode command to stop the NameNode individually and then start the NameNode using /sbin/hadoop-daemon.sh start namenode.
Use /sbin/stop-all.sh and then use /sbin/start-all.sh command to stop all the demons first and then start all the daemons. These script files will be stored in the sbin directory inside the Hadoop directory store.

284. State the difference between an HDFS Block and MapReduce Input Split?
Answer: HDFS Block is the physical division of the disk that has the minimum amount of data which can be read/write, while MapReduce InputSplit is the logical division of data will be created by the InputFormat specified in the MapReduce job configuration.
HDFS will divide data into blocks, whereas MapReduce can divide data into input split and empower them to mapper function.

285. What are the different modes in which Hadoop can run?
Answer:
Standalone Mode(local mode) – This is the default mode where Hadoop will be configured to run. All the components of Hadoop like DataNode, NameNode, etc., can run as a single Java process and useful to debug.
Pseudo Distributed Mode(Single-Node Cluster) – Hadoop can run on a single node in a pseudo-distributed mode. Each Hadoop daemon will work in a separate Java process in Pseudo-Distributed Mode, when in Local mode, each Hadoop daemon will operate as a single Java process.
Fully distributed mode (or multiple node cluster) – All the daemons will be executed in separate nodes build into a multi-node cluster in the fully-distributed mode.

286. What is Apache Pig?
Answer: Apache Pig is a high-level scripting language which is used to create programs to run on Apache Hadoop. The language which is used in this platform is called Pig Latin.
It will execute Hadoop jobs in Apache Spark, MapReduce, etc.

287. What is Apache Hive?
Answer: Apache Hive will offer database query interface to Apache Hadoop. It will read, write, and manage large datasets which will reside in distributed storage and queries through SQL syntax.

288. Where do Hive stores table data in HDFS?
Answer:
/usr/hive/warehouse is the default location where Hive will store the table data in HDFS.

289. How can you configure “Oozie” job in Hadoop?
Answer: Integrate Oozie with the Hadoop stack, which will support several types of Hadoop jobs like Streaming MapReduce, Java MapReduce, Sqoop, Hive, and Pig.

290. What is an Apache Flume?
Answer: Apache Flume is a service/tool/data ingestion mechanism which is used for collecting, aggregating, and transferring massive amounts of streaming data like events, log files, etc., from various web sources to a centralized data store where they will be processed together.
It is a highly reliable, distributed, and configurable tool which is specially designed for transferring streaming data to HDFS.

291. List the Apache Flume features.
Answer: Following are the features of Apache Flume features:
1. It is fault-tolerant and robust.
2. Scales horizontally.
3. Selects high volume data streams in real-time.
4. Streaming data is gathered from multiple sources into Hadoop for analysis.
5. Ensures guaranteed data delivery

292. What is the use of Apache Sqoop in Hadoop?
Answer: Apache Sqoop is a tool particularly used for transferring massive data between Apache Hadoop and external datastores such as relational database management, enterprise data warehouses, etc.

293. Where Hadoop Sqoop scripts are stored?
Answer: Hadoop Sqoop scripts are stored in /usr/bin/Hadoop Sqoop.

294. Why do we need Hadoop?
Answer: Hadoop is needed for big data challenges:
1. Storage – storage of huge amount of data is very difficult.
2. Security – Security of big data is challenges
3. Analytics – Analysing big data is difficult because we don’t The kind of data
4. Data Quality – since data is big, it is very messy, inconsistent and incomplete.
5. Discovery – Algorithm for finding patterns and insights are very difficult.
Apache Hadoop can store huge files because they are raw without specifying any schema.
1. High scalability – Any number of nodes can be added, thus enhances performance dramatically.
2. Reliable – It can store data reliably on the cluster despite machine fails.
3. High availability – In Hadoop data is highly available even though hardware fails. If a machine or hardware will crash, then data can be accessed from another path.
4. Economic – Hadoop will run on a cluster of commodity hardware that is not very expensive.

295. What are configuration files in Hadoop?
Answer:
Core-site.xml – It will contain configuration setting for Hadoop core like I/O settings which are common to HDFS and MapReduce. It will use Hostname and port . 9000 port is commonly used.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

hdfs-site.xml – This file will contain the configuration setting for HDFS daemons. hdfs-site.xml will specify default block replication and permission checking on HDFS.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

mapred-site.xml – We will specify a framework name for MapReduce in this file . It can be specified by setting the mapreduce.framework.name.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

yarn-site.xml – This file will provide configuration setting for NodeManager and ResourceManager.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name> <value>
JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property>
</configuration>

296. Compare Hadoop 2 and Hadoop 3?
Answer:
1. JAVA 7 is the minimum supported version in Hadoop 2, JAVA 8 is supported in Hadoop 3.
2. Hadoop 2 can handle fault tolerance by replication which is wastage of space. But Hadoop 3 can handle fault tolerance by Erasure coding.
3. For data balancing Hadoop 2 will use HDFS balancer. And Hadoop 3 will uses Intra-data node balancer.
4. Some default ports are Linux ephemeral port range in Hadoop 2. Therefore at the time of startup, they fail to bind. But in Hadoop 3 these ports are moved out of the ephemeral range.
5. In hadoop 2, HDFS will have 200% overhead in storage space while Hadoop 3 will have 50% overhead in storage space.
6. Hadoop 2 has features to overcome single point of failure(SPOF). Therefore whenever NameNode will fail, it can recover automatically. Hadoop 3 will recover SPOF automatically without any need of manual intervention to overcome it.

297. Explain Data Locality in Hadoop?
Answer: Cross-switch network traffic is major drawback of Hadoop because of huge volume of data.
This drawback can be overcome by data locality:
Data locality refers to the ability to move the computation near to the actual data will reside on the node, instead of moving large data for computation. Data locality will increases the overall throughput of the system.
HDFS will store datasets in Hadoop, . Datasets can be divided into blocks and will be stored across the datanodes in Hadoop cluster. When a user will run the MapReduce job then NameNode can send this MapReduce code to the datanodes on which data is available which is related to MapReduce job.

298. Data locality has three categories:
Answer: Data local – Data is on the same node because the mapper will work on the data. The proximity of the data will be closer to the computation. It is the very preferred scenario.
Intra – Rack- In this scenarios mapper will run on the different node but on the same rack.It is not always possible for executing the mapper on the same datanode because of constraints.
Inter-Rack – Mapper will run on the different rack. Because it is not possible for executing mapper on a different node in the same rack because of resource constraints.

299. What is Safemode in Hadoop?
Answer: Safemode in Apache Hadoop can be a maintenance state of NameNode. During which NameNode will not allow any modifications to the file system. During Safemode, HDFS cluster is in read-only and will not replicate or delete blocks. At the startup of NameNode:
It will load the file system namespace from the last saved FsImage into its main memory and the edits log file.
Merges will edit log file on FsImage and it will result in new file system namespace.
Then it will receive block reports which will contain information about block location from all datanodes.
In SafeMode NameNode will perform a collection of block reports from datanodes. NameNode can enter safemode automatically during its start up. NameNode will leave Safemode after the DataNodes will reported which most blocks are available.
Use the command:
hadoop dfsadmin –safemode get: To know the status of Safemode
bin/hadoop dfsadmin –safemode enter: To enter Safemode
hadoop dfsadmin -safemode leave: To come out of Safemode
NameNode front page shows whether safemode is on or off

300. How is security achieved in Hadoop?
Answer: Apache Hadoop will achieve security by using Kerberos.
There are three steps which a client should take for accessing a service when using Kerberos. Therefore, each of which will involve a message exchange with a server.
1. Authentication – The client will authenticate itself to the authentication server. It will receive a timestamped Ticket-Granting Ticket (TGT).
2. Authorization – The client will use the TGT for requesting a service ticket from the Ticket Granting Server.
3. Service Request – The client will use the service ticket for authenticating itself to the server.

301. What is throughput in Hadoop?
Answer: Throughput is the quantity of work done in a unit time.
HDFS provides good throughput because of following reason:
1. The HDFS can Write Once and Read Many Model. It will simplify the data coherency issues as the data written once, can not be modified. Therefore, It will provide high throughput data access.
2. Hadoop will work on Data Locality principle which state that moves computation to data instead of data to computation. This will reduce network congestion and thus, It will enhance the overall system throughput.

302. What is fsck?
Answer: fsck is the File System Check which is used by Hadoop HDFS to check for various inconsistencies. It will report the problems with the files in HDFS. Like missing blocks for a file or under-replicated blocks. It is different from the traditional fsck utility for the native file system. Therefore it will not correct the errors if detected.

Normally NameNode will automatically corrects most of the recoverable failures. Filesystem checks will ignore open files. But it has an option to select all files during reporting. The HDFS fsck command is not a Hadoop shell command and it also run as bin/hdfs fsck. Filesystem check will run on the whole file system or on a subset of files.
Usage:
hdfs fsck <path>
[-list-corruptfileblocks |
[-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]] [-includeSnapshots] Path- Start checking from this path
-delete- Delete corrupted files.
-files- Print out the checked files.
-files –blocks- Print out the block report.
-files –blocks –locations- Print out locations for all block.
-files –blocks –rack- Print out network topology for data-node locations
-includeSnapshots- Include snapshot data if the given path indicates or include snapshottable directory.
-list -corruptfileblocks- Print the list of missing files and blocks they belong to.

303. How to debug Hadoop code?
Answer: Following are steps to debug Hadoop Code:
1. Check the list of MapReduce jobs which are currently running.
2. Check if any orphaned jobs is running or not; if yes, we required to determine the location of RM logs.
Run: “ps –ef| grep –I ResourceManager”
3. Look for log directory in the displayed result to find out the job-id from the displayed list.
4. Check whether error message will be associated with that job or not.
5. On the basis of RM logs, identify the worker node that will involve in the execution of the task.
6. Login to the node and run- “ps –ef| grep –I NodeManager”
7. Examine the NodeManager log.
8. The majority of errors will come from user level logs for each amp-reduce job.

304. What does hadoop-metrics.properties file do?
Answer: Statistical information will exposed by the Hadoop daemons is Metrics. Hadoop framework will uses it to monitor performance tuning and debug.
By default, there are many metrics available. Therefore, they are useful for troubleshooting.
Hadoop framework will use hadoop-metrics.properties for ‘Performance Reporting’. It will also controls the reporting for Hadoop. The API can provide an abstraction therefore we will implement on top of a variety of metrics client libraries. Different modules within the same application will use different metrics implementation libraries.
This file is present inside /etc/hadoop.

305. How Hadoop’s CLASSPATH plays a vital role in starting or stopping in Hadoop daemons?
Answer: CLASSPATH will include all directories which contain jar files that is required to start/stop Hadoop daemons.
HADOOP_HOME/share/hadoop/common/lib contains all the utility jar files. We will not able to start/ stop Hadoop daemons if we don’t set CLASSPATH.
We will set CLASSPATH inside /etc/hadoop/hadoop-env.sh file. The next time will run hadoop, the CLASSPATH can automatically add. That is, We don’t need to add CLASSPATH in the parameters each time we run it.

306. What are the different commands used to startup and shutdown Hadoop daemons?
Answer:
1. To start all the hadoop daemons use: ./sbin/start-all.sh.
Then, for stoping all the Hadoop daemons use:./sbin/stop-all.sh
2. we can also start all the dfs daemons together using ./sbin/start-dfs.sh. Yarn daemons together using ./sbin/start-yarn.sh. MR Job history server using /sbin/mr-jobhistory-daemon.sh start history server. Then, to stop these daemons we can use
./sbin/stop-dfs.sh
./sbin/stop-yarn.sh
/sbin/mr-jobhistory-daemon.sh stop historyserver.
3. Finally, the last way is to start all the daemons individually. Then, stop them individually:
./sbin/hadoop-daemon.sh start namenode
./sbin/hadoop-daemon.sh start datanode
./sbin/yarn-daemon.sh start resourcemanager
./sbin/yarn-daemon.sh start nodemanager
./sbin/mr-jobhistory-daemon.sh start historyserver

307. What is configured in /etc/hosts and what is its role in setting Hadoop cluster?
Answer:
./etc/hosts file will contain the hostname and their IP address of that host. It will also, maps the IP address to the hostname. In hadoop cluster, we can store all the hostnames master and slaves with their IP address in ./etc/hosts. Therefore, we will use hostnames easily instead of IP addresses.

308. How is the splitting of file invoked in Hadoop framework?
Answer: Input file will store data for Hadoop MapReduce task’s, and these files can typically reside in HDFS. InputFormat can define how these input files will split and read. It is also responsible to create InputSplit, that is the logical representation of data. InputFormat Which also divides split into records. Then, mapper can process each record which is a key-value pair. Hadoop framework can invoke Splitting of the file by running getInputSplit() method. This method is belongs to InputFormat class such as FileInputFormat defined by the user.

309. What is configured in /etc/hosts and what is its role in setting Hadoop cluster?
Answer:
./etc/hosts file will contain the hostname and their IP address of the host. It will map the IP address to the hostname. We can store all the hostnames (master and slaves) with their IP address in ./etc/hosts in hadoop cluster, . Therefore, we will be using hostnames easily instead of IP addresses.

310. Is it possible to provide multiple input to Hadoop? If yes then how?
Answer: Yes, it is possible to provide multiple inputs to Hadoop by using MultipleInputs class.
Example:
If we have weather data from the UK Office And we want to combine with the NCDC data for our maximum temperature analysis. Then, we will set up the input as follows:
MultipleInputs.addInputPath(job,ncdcInputPath,TextInputFormat.class,MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job,metofficeInputPath,TextInputFormat.class, MetofficeMaxTemperatureMapper.class);

The above code will replace the usual calls to FileInputFormat.addInputPath() and job.setmapperClass(). Both the Met Office and NCDC data will be texted based. So, we can use TextInputFormat for each. We can use two different mappers, because the two data sources will have different line format. The MaxTemperatureMapperr will read NCDC input data and extracts the year and temperature fields. The MetofficeMaxTemperatureMappers can read Met Office input data. It will extract the year and temperature fields.

311. Is it possible to have hadoop job output in multiple directories? If yes, how?
Answer: Yes, it is possible by using following approaches:
a. Using MultipleOutputs class-
This class will be simplify by writing output data to multiple outputs.
MultipleOutputs.addNamedOutput(job,”OutputFileName”,OutputFormatClass,keyClass,valueClass);
The API can provide two overloaded write methods for achieving this
MultipleOutput.write(‘OutputFileName”, new Text (key), new Text(value));
We required to use overloaded write method, with an extra parameter for the base output path. This can allow to write the output file to separate output directories.
MultipleOutput.write(‘OutputFileName”, new Text (key), new Text(value), baseOutputPath);
We required to change baseOutputpath in each of implementation.
b. Rename/Move the file in driver class-
This is the simplest hack for writing output to multiple directories. Therefore, we will use MultipleOutputs and write all the output files to a single directory. The file names required to be different for each category.

312. Why is block size set to 128 MB in Hadoop HDFS?
Answer: Block is a continuous location on the hard drive which will store the data. FileSystem will store data as a collection of blocks. HDFS can store each file as blocks, and will distribute it across the Hadoop cluster. In HDFS, the default size of data block is 128 MB, that we can configure as per our requirement. Block size is set to 128 MB:
To reduce the disk seeks (IO). Larger the block size, lesser the
file blocks and less will be number of disk seek and transfer of the block will be done within respectable limits and that to parallelly.
HDFS have huge data sets, example for terabytes and petabytes of data. If we will take 4 KB block size for HDFS, such as Linux file system, which have 4 KB block size, then we can have too many blocks and therefore too much of metadata. Managing this huge number of blocks and metadata can be created huge overhead and traffic that is something that we don’t required. Therefore, the block size is set to 128 MB.
On the other hand, block size will not be so large that the system can wait a very long time for the last unit of data processing for finishing its work.

313. Can multiple clients write into an HDFS file concurrently?
Answer: Multiple clients don’t write into an HDFS file at same time. Apache Hadoop HDFS can follow single writer multiple reader models. The client is the one that opens a file to write, the NameNode which grants a lease. Assume some other client wants to write into that file. It will ask NameNode for the write operation. NameNode will first check whether it has granted the lease to write into that file to someone else or not. When someone already will acquire the lease, then, it can reject the write request of the other client.

314. How is indexing done in HDFS?
Answer: Once Hadoop framework will store the data as per the block size. HDFS keeps on storing the last part of the data that will state where the next part of the data will be. Basically, this is the base of HDFS. This is called indexing done in HDFS.

315. How to copy a file into HDFS with a different block size to that of existing block size configuration?
Answer:
We can copy a file into HDFS with a different block size by :
–Ddfs.blocksize=block_size, where block_size is in bytes.
Example:
Assume, we want to copy a file called test.txt of size, of 128 MB, into the hdfs. And for this file, We want the block size to be 32MB (33554432 Bytes) in place of the default (128 MB). So, we would issue the following command:
Hadoop fs –Ddfs.blocksize=33554432 –copyFromlocal/home/dataflair/test.txt/sample_hdfs
Now, we can check the HDFS block size associated with this file by:
hadoop fs –stat %o/sample_hdfs/test.txt
Else, we can also use the NameNode web UI for seeing the HDFS directory.

316. Why HDFS performs replication, although it results in data redundancy?
Answer: In HDFS, Replication will provide the fault tolerance. Data replication is one of the most important and unique features of HDFS. Replication of data will solve the problem of data loss in unfavourable conditions such as crashing of the node, hardware failure and so on. HDFS by default will creates three replicas of each block across the cluster in Hadoop. And we will change it as per the need. So, if any node will goes down, we will recover data on that node from the other node.
In HDFS, Replication can lead to the consumption of a lot of space. But the user will always add more nodes to the cluster if needed. HDFS was deploy to store huge data sets. Therefore, it can be changed the replication factor to save HDFS space. Or different codec provided by the Hadoop can be used to compress the data.

317. What is the default replication factor and how will you change it?
Answer: The default replication factor is 3. It can be changed in following three ways:
1. By adding this property to hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>5</value>
<description>Block Replication</description>
</property>
2. It can also be changed on per-file basis using the command:
hadoop fs –setrep –w 3 / file_location
3. It can also be changed for all the files in a directory by using:
hadoop fs –setrep –w 3 –R / directoey_location

318. What do you mean by the High Availability of a NameNode in Hadoop HDFS?
Answer: In Hadoop 1.0, NameNode is a single point of Failure (SPOF), if namenode fails, all clients including MapReduce jobs would be unable to read, write file or list files. In such event, whole Hadoop system will be out of service until new namenode will come online.
Hadoop 2.0 will overcome the single point of failure by giving support for multiple NameNode. High availability feature provides an extra NameNode (active standby NameNode) to Hadoop architecture which is configured for automatic failover. If active NameNode will fail, then Standby Namenode can take all the responsibility of active node and cluster will continue to work.
The initial implementation of HDFS namenode high availability provided for single active namenode and single standby namenode. However, some deployment requires high degree fault-tolerance, this is enabled by new version 3.0, which allows the user to run multiple standby namenode. For instance configuring three namenode and five journal nodes, the cluster is able to tolerate the failure of two nodes rather than one.

319. Explain Erasure Coding in Hadoop?
Answer: By default HDFS will replicate each block three times for several purposes in Hadoop. Replication in HDFS is very simple and robust form of redundancy for shielding against the failure of datanode. But it is very expensive. Therefore, 3 x replication scheme has 200% overhead in storage space and other resources.
Erasure Coding has been introduced in Hadoop 2.x as a new feature for using in the place of Replication. It will also provide the same level of fault tolerance with less space store and 50% storage overhead.
Erasure Coding can use Redundant Array of Inexpensive Disk (RAID). RAID will implement EC through striping. In which it will divide logical sequential data like a file into the smaller unit like bit, byte or block. Then, It will store data on different disk.
Encoding- RAID will calculate and sort Parity cells for each strip of data cells in this process. It will recover error through the parity. Erasure coding will extend a message with redundant data for fault tolerance. EC codec can operate on uniformly sized data cells. In Erasure Coding, codec will take a number of data cells as input and produces parity cells as the output. Data cells and parity cells together are known as an erasure coding group.
There are following algorithms available for Erasure Coding:
1. XOR Algorithm
2. Reed-Solomon Algorithm

320. What is Disk Balancer in Hadoop?
Answer: HDFS will provide a command line tool called Diskbalancer. It can distribute data evenly on all disks of a datanode. This tool can operate against a provided datanode and will move blocks from one disk to another.
Disk balancer will work by creating a plan which is set of statements and executing that plan on the datanode. Therefore, the plan will describe how much data should move between two disks. A plan can compose multiple steps. Move step will have source disk, destination disk and the number of bytes which are required to move. The plan can be executed against an operational datanode.
By default, disk balancer is not enabled; thus, to enable disk balancer dfs.disk.balancer.enabled should be set true in hdfs-site.xml.
Each directory is the volume in hdfs terminology. Therefore, two such policies are:
1. Round-robin: It will distribute the new blocks evenly across the available disks.
2. Available space: It will write data to the disk which has maximum free space by percentage.
323. How would you check whether your NameNode is working or not?
The jps command can be used to check the status of all daemons running in the HDFS

321. What are file permissions in HDFS and how HDFS check permissions for files or directory?
Answer: Hadoop distributed file system (HDFS) will implement a permissions model for files and directories. Therefore, We can manage permissions for a set of three distinct user classes for each file or directory.
The owner, group, and others.The three different permissions for each user class: Read (r), write (w), and execute(x).
For files, the r permission is for reading the file, and the w permission is for writing to the file.
For directories, the r permission is to list the contents of the directory. The w permission is for creating or deleting the directory. X permission is for accessing a child of the directory.
HDFS will check permissions for files or directory:
We will also check the owner’s permissions, if the username will match the owner of the directory.
If the group will match the directory’s group, then Hadoop tests the user’s group permissions.
Hadoop tests the other permission when the owner and the group names will not match.
If none of the permissions checks are succeed, the client’s request will be denied.

322. If DataNode increases, then do we need to upgrade NameNode?
Answer: Namenode can store meta-data for example number of blocks, their location, replicas. This meta-data will be available in memory in the master for faster retrieval of data. NameNode can maintain and manage the slave nodes, and will assigns tasks to them. It will regulate client’s access to files.
It will also execute file system execution like naming, closing, opening files/directories.
During Hadoop installation, framework will determine NameNode which is based on the size of the cluster. Mostly we will not require to upgrade the NameNode because it will not store the actual data. But it can store the metadata, therefore such requirement rarely arise.

323. How many Mappers run for a MapReduce job in Hadoop?
Answer: Mapper task will process each input record from RecordReader and will generate a key-value pair. The number of mappers will depend on two factors:
The amount of data required to process along with block size. It will depend on the number of InputSplit. If the block size of 128 MB and expect 10TB of input data, thus it will have 82,000 maps. Ultimately InputFormat will determine the number of maps.
The configuration of the slave example for number of core and RAM available on the slave. The right number of map/node are between 10-100. Hadoop framework will provide 1 to 1.5 cores of the processor for each mapper. Therefore, for a 15 core processor, 10 mappers will run.
In MapReduce job, by changing the block size we can control the number of Mappers. Therefore, by Changing block size the number of InputSplit will increase or decrease.
By using the JobConf’s conf.setNumMapTasks(int num) one will increase the number of map tasks manually.
Mapper= {(total data size)/ (input split size)}
data size= 1 Tb
Input split size= 100 MB
Hence, Mapper= (1000*1000)/100= 10,000

324. How many Reducers run for a MapReduce job in Hadoop?
Answer: Reducer will take a set of an intermediate key-value pair which is produced by the mapper as the input. Then it will run a reduce function on each of them for generating the output. Therefore, the output of the reducer is the final output, that it will store in HDFS. We do aggregation or summation sort of computation usually in the reducer.
With the help of Job.setNumreduceTasks(int) the user will set the number of reducers for the job. Therefore the right number of reducers can be set by the formula:
0.95 Or 1.75 multiplied by (<no. of nodes> * <no. of the maximum container per node>).
With 0.95, all the reducers will launch immediately and will start transferring map outputs as the map finish.
With 1.75, faster node can finish the first round of reduces and then it will launch the second wave of reduces.
By increasing the number of reducers:
1. Framework overhead increases
2. Increases load balancing
3. Lowers the cost of failures

325. What happens if the number of reducers is 0 in Hadoop?
Answer: If we set the number of reducer to 0, then no reducer can execute and no aggregation can take place. We can prefer Map-only job in Hadoop. In a map-Only job, the map can do all task with its InputSplit and the reducer will do no job. Map output is the final output.
There is key, sort, and shuffle phase between map and reduce phases. Sort and shuffle phase will be responsible for sorting the keys in ascending order. Then grouping values will be based on same keys. This phase is very expensive. If reduce phase is not needed we will avoid it. Avoiding reduce phase can eliminate sort and shuffle phase. This will also save network congestion. Because in shuffling an output of mapper will travel to the reducer, when data size will be huge, large data will travel to the reducer.

326. What do you mean by shuffling and sorting in MapReduce?
Answer: Shuffling and Sorting will take place after the completion of map task. Shuffle and sort phase in hadoop can occur simultaneously.
Shuffling- It is the process of transferring data from the mapper to reducer. example for the process by which the system sorts the key-value output of the map tasks and will transfer it to the reducer.
Therefore, shuffle phase is necessary for reducer, otherwise, they will not have any input. Because shuffling will start even before the map phase are finished. Therefore this will save some time and will complete the task in lesser time.
Sorting- Mapper will generate the intermediate key-value pair. Before starting of reducer, MapReduce framework will sort these key-value pairs by the keys.
Sorting helps reducer to easily distinguish when a new reduce task will start. Therefore it will save time for the reducer.
Shuffling and sorting can not be performed at all if we specify zero reducers (setNumReduceTasks(0)).

327. What is the fundamental difference between a MapReduce InputSplit and HDFS block?
Answer:
By definition
Block – Block is the continuous location on the hard drive where data HDFS will store data. FileSystem will store data like a collection of blocks. Similarly HDFS will store each file as blocks, and will distribute it across the Hadoop cluster.
InputSplit- InputSplit will represent the data that individual Mapper will process. Further split will divide into records. Each record which is a key-value pair can be processed by the map.
Data representation
Block- It is the physical representation of data.
InputSplit- It is the logical representation of data. therefore, during data processing in MapReduce program or other processing techniques use InputSplit. In MapReduce, it is important which InputSplit will not contain the input data. Hence, it will just a reference to the data.
Size
Block- The default size of the HDFS block is 128 MB which is configured as per our requirement. All blocks of the file will be of the same size except the last block. The last Block will be of same size or smaller. The files will split into 128 MB blocks and then it will be stored into Hadoop Filesystem in Hadoop.
InputSplit- Split size are approximately equal to block size, by default.
Example
We required for storing the file in HDFS. HDFS will store files as blocks. Block is the smallest unit of data which will store or retrieved from the disk. The default size of the block is 128MB. HDFS will break files into blocks and will stores these blocks on different nodes in the cluster. We have a file of 130 MB, therefore HDFS can break this file into 2 blocks.
If we required to perform MapReduce operation on the blocks, it will not process, as the 2nd block can be incomplete. InputSplit will solve this problem. InputSplit will form a logical grouping of blocks as a single block. As the InputSplit can include a location for the next block, it will also include the byte offset of the data needed to complete the block.

328. How to submit extra files(jars, static files) for MapReduce job during runtime?
Answer: MapReduce framework will provide Distributed Cache to caches files required by the applications. It will cache read-only text files, archives, jar files etc.
First of all, an application which is required for using distributed cache to distribute a file will make sure that the files are available on URLs. Therefore, URLs will be either hdfs:// or http://. if the file will be present on the hdfs:// or http://urls. Then, user mentions it to be cache file for distributing. This framework can copy the cache file on all the nodes before starting of tasks on those nodes. The files will only copied once per job. Applications will not modify those files.
By default size of the distributed cache is 10 GB. We will adjust the size of distributed cache using local.cache.size.

329. What kind of Hardware is best for Hadoop?
Answer: Hadoop will run on a dual processor/ dual core machines with 4-8 GB RAM using ECC memory. It will depend on the workflow needs.

330. Explain the use of .mecia class?
Answer: We use .mecia class for the floating of media objects from one side to another.

331. Give the use of the bootstrap panel.
Answer: We will use panels in bootstrap from the boxing of DOM components.

332. What is the purpose of button groups?
Answer: Button groups can be used for the placement of more than one buttons in the same line.

333. Name the various types of lists supported by Bootstrap.
Answer: Following the various types of lists supported by Bootstrap:
1. Ordered list
2. Unordered list
3. Definition list

334. What is shuffling in MapReduce?
Answer: Shuffling is a process which is used for performing the sorting and transferring the map outputs for the reducer as input.

335. How is indexing done in HDFS?
Answer: here is a very unique way of indexing in Hadoop. Once the data will be stored as per the block size, the HDFS will keep on storing the last part of the data which will specify the location of the next part of the data.

336. What are the network requirements for using Hadoop?
Answer: Following are the network requirement for using Hadoop:
1. Password-less SSH connection.
2. Secure Shell (SSH) for launching server processes.

337. What do you know by storage and compute node?
Answer: Storage node: Storage Node is the machine or computer where file system will reside to store the processing data.
Compute Node: Compute Node is a machine or computer where actual business logic will be executed.

338. How to debug Hadoop code?
Answer: There are many ways for debugging Hadoop codes but the most popular methods are:
1. By using Counters.
2. By web interface provided by the Hadoop framework.

339. What commands are used to see all jobs running in the Hadoop cluster and kill a job in LINUX?
Answer: Following are used to see all jobs running in the Hadoop cluster and kill a job in LINUX:
1.Hadoop job – list
2. Hadoop job – kill jobID

340. Do you know some companies that are using Hadoop?
Answer: Following companies that are using Hadoop:
1.Yahoo – using Hadoop
2. Facebook – developed Hive for analysis
Spotify, Amazon, Adobe, Netflix, eBay, and Twitter are some other companies which are using Hadoop.

341. What should you consider while deploying a secondary NameNode?
Answer: A secondary NameNode will always be deployed on a separate Standalone system. This will prevents it from interfering with the operations of the primary node.

342. What are the important properties of hdfs-site.xml?
Answer: There are following important properties of hdfs-site.xml:
1. data.dr – identify the location of the storage of data.
2. name.dr – identifies the location of metadata storage and specify whether DFS is located on disk or the on the remote location.
3. checkpoint.dir – for Secondary NameNode.

343. What are the essential Hadoop tools that enhance the performance of Big Data?
Answer: Some of the essential Hadoop tools which will enhance the performance of Big Data are :
Hive, HDFS, HBase, Avro, SQL, NoSQL, Oozie, Clouds, Flume, SolrSee/Lucene, and ZooKeeper

344. What happens when two clients try to access the same file in HDFS?
Answer: HDFS is known for supporting exclusive writes which processes one write request for a file at a time only.
When the n first client will contacts the NameNode for opening the file to write, the NameNode will provide a lease to the client for creating this file. When the second client will send a request for opening the same file for writing, the NameNode find that the lease for that file has already been provided to another client, and therefore it will reject the second client’s request.

345. What is the procedure to compress mapper output without affecting reducer output?
Answer: Without affecting reducer output, in order to compress the mapper output , set the following:
Conf.set(“mapreduce.map.output.compress” , true)
Conf.set(“mapreduce.output.fileoutputformat.compress” , false)

346. Explain different methods of a Reducer.
Answer: The different methods of a Reducer are as follows:
1. Setup() – It is used for configuring different parameters like input data size.
Syntax: public void setup (context)
2. Cleanup() – It is used to clean all the temporary files at the end of the task.
Syntax: public void cleanup (context)
3. Reduce() – This method is called as the heart of the reducer. It can be regularly used once per key with the associated reduce task.
Syntax: public void reduce (Key, Value, context)
350. What are the common Hadoop shell commands used for Copy operation?
The following are common Hadoop shell commands to Copy operation:
1. fs –copyToLocal
2. fs –put
3. fs –copyFromLocal

347. We have a Hive partitioned table where the country is the partition column. We have 10 partitions and data is available for just one country. If we want to copy the data for other 9 partitions, will it be reflected with a command or manually?
Answer: In the above example, the data can only be available for all the other partitions when the data put through command, instead of copying it manually.

348. What is the difference between -put, -copyToLocal, and and –copyFromLocal commands?
Answer: These three commands are differentiated in following way
-put: This command is used for copying the file from a source to the destination
-copyToLocal: This command is used for copying the file from Hadoop system to the local file system.
-copyFromLocal: This command is used for copying the file from the local file system to the Hadoop System.

349. What is the difference between Left Semi Join and Inner Join?
Answer: The Left Semi Join will return the tuples only from the left-hand table and the Inner Join will return the common tuples from both the tables example for left-hand and right-hand tables will depend on the provided condition.

350. Is it possible to change the block size in Hadoop?
Answer: Yes, it is possible to change the block size from the default value. The following parameter will be used hdfs-site.xml file to change and set the block size in Hadoop –
dfs.block.size

351. How will you check if NameNode is working properly with the use of jps command?
Answer: The following status will be used for checking whether NameNode is working with the use of jps command
/etc/init.d/hadoop-0.20-namenode

352. MapReduce jobs are getting failed on a recently restarted cluster while these jobs were working well before the restart. What can be the reason for this failure?
Answer: Depending on the size of data, the replication of data can take some time. Hadoop cluster will need to copy/replicate all the data. Therefore, the clear reason for job failure is the big data size, and therefore the replication process is being delayed. It will take even few minutes to some hours for taking place and thus, for the jobs to work properly.

353. What does /etc /init.d do?
Answer:
/etc /init.d will specify where daemons are placed or for seeing the status of these daemons. It is very LINUX specific, and not related with Hadoop.

354. What if a Namenode has no data?
Answer: If Namenode will not have data then It cannot be part of the Hadoop cluster.

355. What happens to job tracker when Namenode is down?
Answer: When Namenode is down, then cluster will be OFF, because Namenode is the single point of failure in HDFS.

356. Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?
Answer: No, calculations are done only on the original data. The master node will have information about which node has the particular data. If one of the nodes will not respond, it will be assumed to be failed. The required calculation is done on the second replica.

357. What is the communication channel between client and namenode/datanode?
Answer: The mode of communication is SSH between client and 362/datanode.

358. What is a rack?
Answer: Rack is a storage area where all the datanodes are put together. These datanodes will be physically located at different places. Rack is a physical collection of datanodes that are stored at a single location. There will be multiple racks in a single location

359. Why ‘Reading‘ is done in parallel and ‘Writing‘ is not in HDFS?
Answer: Through mapreduce program the file will be read by splitting its blocks when reading. But during writings the incoming values are not yet known to the system therefore, mapreduce cannot be applied and no parallel writing is possible.

360. Copy a directory from one node in the cluster to another
Answer:
‘-distcp’ command is used to copy.
Default replication factor to a file is 3.
‘-setrep’ command is used to change replication factor of a file to 2.
hadoop fs -setrep -w 2 apache_hadoop/sample.txt

361. Which file does the Hadoop-core configuration?
Answer: core-default.xml

362. Is there a hdfs command to see available free space in hdfs.
Answer: hadoop dfsadmin -report

363. The requirement is to add a new data node to a running Hadoop cluster; how do I start services on just one data node?
Answer: We do not required to shutdown and/or restart the entire cluster in the case.
Add the new node’s DNS name to the conf/slaves file on the master node.
Log in to the new slave node to execute:
$ cd path/to/hadoop
$ bin/hadoop-daemon.sh start datanode
$ bin/hadoop-daemon.sh start tasktracker
issuehadoop dfsadmin -refreshNodes and hadoop mradmin -refreshNodes so that the NameNode and JobTracker know of the additional node which has been added.

364. How do you gracefully stop a running job?
Answer: Hadoop job –kill jobid

365. Does the name-node stay in safe mode till all under-replicated files are fully replicated?
Answer: No. During safe mode replication of blocks will be prohibited. The name-node can await when all or majority of data-nodes will report their blocks.

366. You have a directory XYZ that has the following files – Hadoop343training.txt,_Spark343Training.txt,#DataScience343Training.txt, .Salesforce343Training.txt. If we pass the XYZ directory to the Hadoop MapReduce jobs, how many files are there to be processed?
Answer: Hadoop343Training.txt and #DataScience343Training.txt are the only files which will be processed by MapReduce jobs. This will happen because we required to confirm that none of the files has a hidden file prefix like “_” or “.” while processing a file in Hadoop using a FileInputFormat. MapReduce FileInputFormat can use HiddenFileFilter class by default to ignore all such files. However, we will create our custom filter to eliminate such criteria.

367. What are the different Flume-NG Channel types?
Answer: Following are types of Flume-NG
1. Memory Channel,
2. JDBC Channel,
3. Kafka Channel,
4. File Channel,
5. Spillable Memory Channel,
6. Pseudo Transaction Channel.
In basic Flume, we have channel type such as memory, JDBC, file and Kafka.

368. What is Base class in java?
Answer: A base class is also a class that will facilitates the creation of other classes. In terms of object oriented programming, it will be referred as derived class. This will help for reusing the code implicitly from base class except constructors and destructors.

369. What is Base class in scala?
Answer: Base class concept is similar for both java and scala. The difference is in syntex. The Keywords in Scala are Base and Derived.
Example
abstractclassBase( val x : String )
finalclassDerived(x : String) extendsBase(“Base’s ” + x )
{
overridedeftoString = x
}

370. What is Immutable data with respect to Hadoop?
Answer: Immutability is the idea Which data or objects will not be modified once they are created. This concept will provide the basic functionalities of the Hadoop in computing the large data without any data loss or failures. Programming languages, such as Java and Python, treat strings as immutable objects that means we can not be able change it.

371. How is formatting done in HDFS?
Answer: Hadoop distributed file system(HDFS) is formatted using bin/hadoop namenode -format command. This command will format the HDFS via NameNode. This command is only used for the first time. Formatting the file system means start working of the directory which is specified by the dfs.name.directory variable. If we will execute this command on existing filesystem, we will delete all the data stored on the NameNode. Formatting a Namenode will not format the DataNode.

372. What are the contents found in masterfile of hadoop?
Answer: The masters file contains information about Secondary NameNode server location.

373. Explain about spill factor with respect to the RAM?
Answer: The map output will be stored in an in-memory buffer; when this buffer is almost full, then spilling phase will begin in order to transport the data to a temp folder.

Map output will be first written to buffer and buffer size is decided by mapreduce.task.io.sort.mb .By default, it will be 100 MB.
When the buffer can outreach certain threshold, it start spilling buffer data to disk. This threshold is specified inmapreduce.map.sort.spill.percent .

374. Why do we require a password-less SSH in Fully Distributed environment?
Answer: We requires a password-less SSH in a Fully-Distributed environment when the cluster will be live and working in Fully Distributed environment and the communication is very frequent. The DataNode and the NodeManager will be able to transport messages quickly to master server.

375. How to copy file from local hard disk to hdfs?
Answer: hadoop fs -copyFromLocal localfilepath hdfsfilepath

376. Map-side join / hive join
Answer: To optimize the performance in Hive queries, we use Map-side Join in Hive. We can use Map-Side Join when one of the tables in the join is small in size and will not be loaded into primary memory.
Therefore Managed table stores the data in /user/hive/warehouse/tablename folder. And once we will drop the table, along with the table schema, the data will be lost.
External table will store the data in the location which is specified by the user. And once we will drop the table, only table schema is lost. The data still will be available in HDFS for further use.

377. Differentiate between bucketing and partitioning
Answer: Bucketing – Bucketing concept can be used for data sampling. We will use Hive bucketing concept on Hive Managed tables / External tables. We will perform bucketing on a single column only not more than one column. The value of this single column can be distributed into number of buckets by using hash algorithm. Bucketing is an optimization technique and it will improve the performance.
Partitioning – we will do partitioning with one or more columns and sub-partitioning (Partition within a Partition) is allowed. We required to give the number of static partitions in static partitioning. But in dynamic partitioning, the number of partitions will be decided by number of unique values in the partitioned column.

378. Syntax to create hive table with partitioning
Answer: Create table tablename
(
var1 datatype1,
var2 datatype2,
var3 datatype3
)
PARTITIONED BY (var4 datatype4,var5 datatype5)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘delimiter’
LINES TERMINATED BY ‘\n’
TBLPROPERTIES (“SKIP.HEADER.LINE.COUNT”=”1”)

379. file formats available in SQOOP Import
Answer: Delimited Text and sequenceFile
Delimited Text is default import file format. We can specify it as stored as-textfile
sequenceFile is binary format.

380. For Default number of mappers in a sqoop command.
Answer: The default number of mappers is four in a sqoop command.

381. Maximum number of mappers used a sqoop import command.
Answer: The maximum number of mappers depends on many variables:
Database type.
Hardware that is used for your database server.

382. Flume Architecture
Answer: External data source ==> Source ==> Channel ==> Sink ==> HDFS

383. In Unix, command to show all processes
Answer: ps

384. What is Interceptor?
Answer: This is a Flume Plug-in that will help to listen any Incoming and alter event’s content on the Fly.

385. File formats in hive.
Answer: Following are the file formats in hive:
1. ORC File format – Optimized Row Columnar file format
2. RC File format – Row Columnar file format
3. TEXT File format – Defalut file format
4. Sequence file format – If the size of a file is smaller than the data block size in Hadoop, it will be consider as a small file. Due to this, metadata will increase that will be an overhead to the NameNode. To solve this problem, sequence files can be introduced. Sequence files will act as containers for storing multiple small files.
Avro file format
Custom INPUT FILE FORMAT and OUTPUT FILE FORMAT

386. Syntax to create bucketed table.
Answer: create table tablename
(
var1 datatype1,
var2 datatype2,
var3 datatype3
)
PARTITIONED BY (var4 datatype4,var5 datatype5)
CLUSTERED BY (VAR1) INTO 5 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘delimiter’
LINES TERMINATED BY ‘\n’
TBLPROPERTIES (“SKIP.HEADER.LINE.COUNT”=”1”)

387. What is Custom Partitioning?
Answer: Custom Partitioner is a process which will allow us to store the results in different reducers, based on the user condition. By setting a partitioner to partition by the key, we will confirm that, records for the same keys can go to the same reducers.

388. Difference between order by and sort by
Answer: Hive will supports sortby – sort the data per reducer and orderby – sort the data for all reducers mean sort the total data.

389. Sqoop Incremental last modified.
Answer: bin/sqoop import –connect jdbc:mysql://localhost/database –table table_name –incremental-lastmodified –check-column column_name –last-value ‘value’ -m 1

390. Difference MR1 vs MR2
Answer: MR1 – It will consist of Job Tracker and Task Tracker for processing and name node and data node for storing. It will support only MR framework.
MR2 – Job Tracker will be splitted again into two parts application master one per mr job and resource manager only one. It can support MR framework and other frameworks like spark, storm.

391. Select * from table – give what results for normal table and partitioned table.
Answer: It will provide same results in both the scenarios

392. What is Explode and implode in hive.
Answer:
1. Explode – It will explore the array of values into the individual values.
Syntax:
select pageid, adid from page LATERAL VIEW explode (adid_list) mytable as adid;
2. Implode – It will collect aggregates records into either an array or map from multiple rows. It is the opposite of an explode().
Syntax: select userid, collect(actor_id) from actor group by userid;

393. Explain Interceptors in Flume.
Answer: Interceptors will be designed to modify or drop an event of data. Flume is designed for picking the data from source and drops it into Sink.
Timestamp Interceptors: This can add the timestamp at which process is running to the header event.
Host Interceptors: this can write the hostname or ip address of the host system on which the agent or process will be running to the event of data.
Static Interceptors: This can add the static string along with the static header to all events;
UUID Interceptors: Universla Unique Identifier, this setups a UUID on all events that will be intercepted.
Search and Replace Interceptors: this can search and replace a string with a value in the event data.
Regex filtering Interceptors: This is used for inclusion/exclusion an event. This filters events selectively by interpreting a exent body as text and against a matching text against a configured regular expression.
Regex Extractor Interceptors: this can extract a match of regex interceptors againest a regular expression.

394. Write a pig script to extract hive table
Answer: We required to enter the pig shell with option useHCataLog (pig -useHCataLog).
A = LOAD ‘tablename’ USING org.apache.hive.hcatalog.pig.HCatLoader();
A = LOAD ‘airline.airdata’ USING org.apache.hive.hcatalog.pig.HCatLoader();

395. Predefined value in sqoop to extract data from any database current date minus one
Answer:
Syntax: sqoop import –connect jdbc:mysql://localhost/database –table table_name –where “time_stamp > day(now()-1)”

396. UNION, UNIONALL, MINUS and INTERSECT available in hive?
Answer:
UNION and UNIONAL available
Select_statement UNION [ALL | DISTINCT] select_statement

MINUS keyword is not available in Hive
INTERSECT keyword is not available in Hive

397. Difference between Distribute by, cluster by, order by, sort by
Answer: 1. Distribute by – Distribute the data among n reducers inun-sorted manner.
2. Cluster by – Distribute the data among n reducers and sort the data in Distribute by and sort by.
3. order by – sort the data for all reducers.
4. sort by – sort the data per reducer.

398. Why do nodes are extracted and added regularly in Hadoop cluster?
Answer: The Hadoop framework will use materials hardware, and it is one of the great features of the Hadoop framework. It can appear in a common DataNode crash in a Hadoop cluster.
The ease of scale is a yet different primary feature of the Hadoop framework which will be implemented according to the rapid increase in data volume.

399. What are the various programs available in Hadoop?
Answer: The differently available schedulers in Hadoop are – COSHH – It will lists resolutions by analyzing cluster, workload, and managing heterogeneity. FIFO Scheduler – It will orders the jobs on the base of their approach time in a line without using heterogeneity.
Fair Sharing – It can define a supply for each user which will include a representation of pictures and defeat slots on a resource. Each user will be granted to use own pool for the performance of jobs

400. What is the name of Hope’s name?
Answer: The name Node is a terminal in Hadoop, where HAPOOP can store all file location information in Hadoop shared file system. It means name Net is the core of the HDFS file system. It will keep the records of all files in the file system and oversees file data within the cluster or on multiple computers.

401. What is JobTracker in Hupa? What does hatio continue to do?
Answer: Job Tracker will be used in Hadoop to submit and will monitor Map Reduce jobs. The work tracker can run on its own JVM process.
Work will perform the following activities in the tracker’s hoard
1. Client application will submit work to work supervisor
2. Contacting Job Tracker Name Mode for determining the data location
3.The Job Tracker Task Tracker edges near the location or available locations which are available
4. In the selected Task Tracker nodes, it will submit to the job
5. If a work fails, the worker announces and decide what can be done.
6. Task Tracker edges will be tracked by Job Tracker,

402. What happens if one Hadoop client renames a file or a directory containing this file while another client is still writing into it?
Answer: A file can appear in the name space as soon as it will. E created. If a writer will write to a file and another client will renames either the file itself or any of its path components, then the original writer will get an IOException when it will finish writing to the current block or when it will close the file.

403. How to make a large cluster smaller by taking out some of the nodes?
Answer: Hadoop will offer the decommission feature for retiring a set of existing data-nodes. The nodes to be retired will be included into the exclude file, and the exclude file name will be specified as a configuration parameter dfs.hosts.exclude.
The decommission process will be terminated at any time by editing the configuration or the exclude files and repeating the -refreshNodes command.

404. Can we search for files using wildcards?
Answer: Yes. We can search for files using wildcards. For example, to list all the files that begin with the letter a, We could use the ls command with the * wildcard &minu;
hdfs dfs –ls a*

405. What will happen if file could only be replicated to 0 nodes, instead of 1 mean?
Answer: The namenode does not have any available DataNodes.

 

Hadoop is an open source data processing framework that is widely used by majority of the organizations to manages data processing and storage for big data clusters of their businesses. Hadoop is the leading technology used in data analytics, data processing and predictive analytics etc.,

Hadoop has the capability to process structured and unstructured data which gives the organizations lot of options to process the data associated with their business that is available in various formats from different sources. Hadoop has become a default technology for any organization which has any initiatives towards big data processing. As the demand for data analytics is only going to increase in coming years, it expected that hadoop will continue to be in demand. If you are either pursuing or looking to start career in hadoop, then we strongly suggest you to go through the above listed 405 frequently asked Hadoop Interview questions which will immensely help you to enrich your knowledge and also will help you to succeed in your next job interview.

Kiara is a career consultant focused on helping professionals get into job positions in the tech industry. She is also a skilled copywriter who has created well-crafted content for a wide range of online publications in the career and education industry.