HDFS BlocksizeΒΆ
Let us get into details related to blocksize in HDFS.
HDFS stands for Hadoop Distributed File System.
It means the large files will be physically stored on multiple nodes in distributed fashion.
Let us review the
hdfs fsck
output of/public/randomtextwriter/part-m-00000
. The file is approximately 1 GB in size and you will see 9 files.8 files of size 128 MB
1 file of size 28 MB approximately
It means a file of size 1 GB 28 MB is stored in 9 blocks. It is due to the default block size which is 128 MB.
%%sh
hdfs dfs -ls -h /public/randomtextwriter/part-m-00000
-rw-r--r-- 3 hdfs hdfs 1.0 G 2017-01-18 20:24 /public/randomtextwriter/part-m-00000
%%sh
hdfs fsck /public/randomtextwriter/part-m-00000 \
-files \
-blocks \
-locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/randomtextwriter/part-m-00000 at Thu Jan 21 05:42:10 EST 2021
/public/randomtextwriter/part-m-00000 1102230331 bytes, 9 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1074171511_431441 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
1. BP-292116404-172.16.1.101-1479167821718:blk_1074171524_431454 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
2. BP-292116404-172.16.1.101-1479167821718:blk_1074171559_431489 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
3. BP-292116404-172.16.1.101-1479167821718:blk_1074171609_431539 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
4. BP-292116404-172.16.1.101-1479167821718:blk_1074171657_431587 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK]]
5. BP-292116404-172.16.1.101-1479167821718:blk_1074171691_431621 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
6. BP-292116404-172.16.1.101-1479167821718:blk_1074171721_431651 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
7. BP-292116404-172.16.1.101-1479167821718:blk_1074171731_431661 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
8. BP-292116404-172.16.1.101-1479167821718:blk_1074171736_431666 len=28488507 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK]]
Status: HEALTHY
Total size: 1102230331 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 9 (avg. block size 122470036 B)
Minimally replicated blocks: 9 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu Jan 21 05:42:10 EST 2021 in 1 milliseconds
The filesystem under path '/public/randomtextwriter/part-m-00000' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Frandomtextwriter%2Fpart-m-00000
The default block size is 128 MB and it is set as part of hdfs-site.xml.
The property name is
dfs.blocksize
.If the file size is smaller than default blocksize (128 MB), then there will be only one block as per the size of the file.
%%sh
cat /etc/hadoop/conf/hdfs-site.xml
Let us determine the number of blocks for
/data/retail_db/orders/part-00000
. If we store this file of size 2.9 MB in HDFS, there will be one block associated with it as size of the file is less than the block size.It occupies 2.9 MB storage in HDFS (assuming replication factor as 1)
%%sh
ls -lhtr /data/retail_db/orders/part-00000
-rw-r--r-- 1 root root 2.9M Feb 20 2017 /data/retail_db/orders/part-00000
%%sh
hdfs fsck /user/${USER}/retail_db/orders/part-00000 -files -blocks -locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db/orders/part-00000 at Thu Jan 21 05:43:52 EST 2021
/user/itversity/retail_db/orders/part-00000 2999944 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455902_41737439 len=2999944 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
Status: HEALTHY
Total size: 2999944 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 2999944 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu Jan 21 05:43:52 EST 2021 in 1 milliseconds
The filesystem under path '/user/itversity/retail_db/orders/part-00000' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fuser%2Fitversity%2Fretail_db%2Forders%2Fpart-00000
Let us determine the number of blocks for
/data/yelp-dataset-json/yelp_academic_dataset_user.json
. If we store this file of size 2.4 GB in HDFS, there will be 19 blocks associated with it18 128 MB Files
1 ~69 MB File
It occupies 2.4 GB storage in HDFS (assuming replication factor as 1)
%%sh
ls -lhtr /data/yelp-dataset-json/yelp_academic_dataset_user.json
-rwxr-xr-x 1 training training 2.4G Feb 5 2019 /data/yelp-dataset-json/yelp_academic_dataset_user.json
We can validate by using
hdfs fsck
command against the same file in HDFS.
%%sh
hdfs fsck /public/yelp-dataset-json/yelp_academic_dataset_user.json \
-files \
-blocks \
-locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/yelp-dataset-json/yelp_academic_dataset_user.json at Thu Jan 21 05:44:47 EST 2021
/public/yelp-dataset-json/yelp_academic_dataset_user.json 2485747393 bytes, 19 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1101225469_27499779 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
1. BP-292116404-172.16.1.101-1479167821718:blk_1101225470_27499780 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
2. BP-292116404-172.16.1.101-1479167821718:blk_1101225471_27499781 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
3. BP-292116404-172.16.1.101-1479167821718:blk_1101225472_27499782 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK]]
4. BP-292116404-172.16.1.101-1479167821718:blk_1101225473_27499783 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK]]
5. BP-292116404-172.16.1.101-1479167821718:blk_1101225474_27499784 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK]]
6. BP-292116404-172.16.1.101-1479167821718:blk_1101225475_27499785 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
7. BP-292116404-172.16.1.101-1479167821718:blk_1101225476_27499786 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
8. BP-292116404-172.16.1.101-1479167821718:blk_1101225477_27499787 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
9. BP-292116404-172.16.1.101-1479167821718:blk_1101225478_27499788 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK]]
10. BP-292116404-172.16.1.101-1479167821718:blk_1101225479_27499789 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
11. BP-292116404-172.16.1.101-1479167821718:blk_1101225480_27499790 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
12. BP-292116404-172.16.1.101-1479167821718:blk_1101225481_27499791 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
13. BP-292116404-172.16.1.101-1479167821718:blk_1101225482_27499792 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
14. BP-292116404-172.16.1.101-1479167821718:blk_1101225483_27499793 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
15. BP-292116404-172.16.1.101-1479167821718:blk_1101225484_27499794 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
16. BP-292116404-172.16.1.101-1479167821718:blk_1101225485_27499795 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
17. BP-292116404-172.16.1.101-1479167821718:blk_1101225486_27499796 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK]]
18. BP-292116404-172.16.1.101-1479167821718:blk_1101225487_27499797 len=69828289 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK]]
Status: HEALTHY
Total size: 2485747393 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 19 (avg. block size 130828810 B)
Minimally replicated blocks: 19 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu Jan 21 05:44:47 EST 2021 in 1 milliseconds
The filesystem under path '/public/yelp-dataset-json/yelp_academic_dataset_user.json' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Fyelp-dataset-json%2Fyelp_academic_dataset_user.json