HDFS File Read Operations

HDFS plays with large sets of data. It is interesting to know manages File I/O operations request to the client. Before going to understand read operation. Let’s take of few important components.

HDFS Client:

The HDFS client interacts with Namenode and Datanode of the user to fulfill the user request. The user establishes communication with HDFS through File System. API and normal I/O operations, processing of user request and response. It carries out by File System API processes.

Namenode:

Namenode is the master node of HDFS cluster. It stores metadata information and edits log in it. Metadata information contains addresses of block locations of Datanodes.

Datanode:

Datanodes known as slave nodes holds the actual data. Datanode only stores block, a block is a used to store and process the data. Data resides within blocks of Datanode. Datanode gives periodic heartbeat signals to Master node to slave. That it is alive and use to store and retrieve data

Packet:

A Packet is a small chunk of data. Which is use during transmission; the packet is a subset of the block. The default size of the block is 64 MB or 128 MB. It will create a huge network overload transfer data of the size of blocks. Hence client API transfers this block in small chunks known as packets.

HDFS File Read workflow

1. To start the file read operation. The client opens the required file by calling open () on Filesystem object. This is an instance of DistributedFileSystem. The open method starts HDFS client for the read request. 2. DistributedFileSystem interacts with Namenode. The block locations of the file to read. Block locations stored in metadata of name node. For each block, Namenode returns the sorted address of Datanode holds the copy of that block. Here sort based on the proximity of Datanode about Namenode. 3. DistributedFileSystem returns an FSDataInputStream. Which is input stream to support file to the client. The FSDataInputStream uses DFSInputStream to manage I/O operations over Namenode and Datanode. a) Client calls read() on DFSInputStream. The DFSInputStream holds the list of address of block locations on Datanode. For the first few blocks of the file. It then locates the first block on closest Datanode. b) Block reader gets initialize on target Block/Datanode • Block ID • Data start offset to read from. • A length of data to read. • Client name. c) Data is stream from the Datanode back to the client in form of packets. This data is copy to input buffer provided by the client. DFS client is reading and performing checksum operation and updating the client buffer d) Read () on stream till the end of block reach. The end of the block reach DFSInputStream will close the connection to Datanode. Search next closest Datanode to read the block from it. 4. Blocks in order once DFSInputStream did through the reading of the first few blocks. It calls the Namenode to retrieve Datanode locations for the next batch of blocks. 5. When the client finish reading call Close() on FSDataInputStream to close the connection. 6. If Datanode is down reading or DFSInputStream an error during communication. DFSInputStream will switch to next available Datanode where replication factor. DFSInputStream remembers the Datanode. This encountered an error so that it does not retry them for later blocks. That client with the help of Namenode gets the list of best Datanode for each block. They communicate with Datanode to retrieve the data. Here Namenode serves the address of block location on Datanode. Then serving data itself become the bottleneck as the number of clients grows. This design HDFS to scale up to large numbers of clients. Since the data traffic across all the Datanodes of clusters.

Best Online Hadoop Training in Hyderabad

Best Online Hadoop Training in Hyderabad: We Provide Best Online Hadoop Training in Hyderabad by real time experts. We Offer Hadoop live project.

Internals of HDFS File Read Operationsc