A detailed explanation of Hadoop core architecture HDFS
In the Hadoop entry learning stage, many students know that in the Hadoop framework, HDFS provides distributed storage support, so they often misunderstand HDFS: Is HDFS a database? What kind of database is HDFS? In fact, HDFS is not a database, the official definition is called a distributed file system, how to understand it?
1. Background and Definition
HDFS: Distributed file system, used to store files, the main feature is its distribution, that is, many servers work together to achieve its functions, and each server in the cluster has its own role.
As the amount of data becomes larger and larger, an operating system cannot store all the data, so it is allocated to more disks managed by the operating system, but management and maintenance are extremely inconvenient, so a system is urgently needed to manage multiple disks The files on the machine, this is the distributed management system, HDFS is one of them.
The use of HDFS is suitable for the scenario of writing once and reading many times, and does not support direct modification of the file, only supports appending at the end of the file
HDFS adopts a streaming data access method: the feature is that, like running water, the data does not come at once, but “flows” bit by bit, and the data is processed bit by bit. If the data is processed after all the data has come, the delay will be very large, and it will consume a lot of memory.
2. Why choose HDFS to store data
HDFS is chosen to store data because HDFS has the following advantages.
High fault tolerance: data is automatically saved in multiple copies, and automatically recovered after the copy is lost
Suitable for batch processing: mobile computing and data on the fly. The data location is exposed to the computing framework
Suitable for big data processing: GB, TB, set PB level data. The number of files over a million scale. 10K+ node scale.
Streaming file access: write once, read many times. Ensure data consistency.
Can be built on cheap machines: Improve reliability with multiple copies. Provides fault tolerance and recovery mechanisms.
3. Disadvantages of HDFS
Not suitable for low-latency access: If you want to process some low-latency application requests that users require a relatively short time, HDFS is not suitable. HDFS is designed to handle large dataset analysis tasks, primarily to achieve high data throughput, which may require high latency as a trade-off.
Inability to store small files efficiently: Because the nameNode places the metadata of the file system in memory, the number of files that the file system can hold is determined by the memory size of the NameNode. Generally speaking, each file, folder, and Block It needs to occupy about 150 bytes of space, so if you have 1 million files, each occupying a block, you need at least 300MB of memory. Currently, millions of files are still feasible. At a billion, it’s not achievable with current hardware levels. Another problem is that because the number of map tasks is determined by splits, when MR is used to process a large number of small files, too many map tasks will be generated, and the thread management overhead will increase the job time. For example, to process 10000M files, if each split is 1M, then there will be 10000 maptasks, which will have a lot of thread overhead; if each split is 100M, then 100 Maptasks, each maptask will have more things to do, and the thread management overhead will be much reduced.
Do not support multi-user writing and arbitrary modification of files: a file has only one writing thread, multiple threads cannot read and write at the same time, and the writing operation can only be completed at the end of the file, only file appending is supported, and modification is not supported.