HDFS BlocksizeΒΆ

Let us get into details related to blocksize in HDFS.

  • HDFS stands for Hadoop Distributed File System.

  • It means the large files will be physically stored on multiple nodes in distributed fashion.

  • Let us review the hdfs fsck output of /public/randomtextwriter/part-m-00000. The file is approximately 1 GB in size and you will see 9 files.

    • 8 files of size 128 MB

    • 1 file of size 28 MB approximately

  • It means a file of size 1 GB 28 MB is stored in 9 blocks. It is due to the default block size which is 128 MB.

%%sh

hdfs dfs -ls -h /public/randomtextwriter/part-m-00000
-rw-r--r--   3 hdfs hdfs      1.0 G 2017-01-18 20:24 /public/randomtextwriter/part-m-00000
%%sh

hdfs fsck /public/randomtextwriter/part-m-00000 \
    -files \
    -blocks \
    -locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/randomtextwriter/part-m-00000 at Thu Jan 21 05:42:10 EST 2021
/public/randomtextwriter/part-m-00000 1102230331 bytes, 9 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1074171511_431441 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
1. BP-292116404-172.16.1.101-1479167821718:blk_1074171524_431454 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
2. BP-292116404-172.16.1.101-1479167821718:blk_1074171559_431489 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
3. BP-292116404-172.16.1.101-1479167821718:blk_1074171609_431539 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
4. BP-292116404-172.16.1.101-1479167821718:blk_1074171657_431587 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK]]
5. BP-292116404-172.16.1.101-1479167821718:blk_1074171691_431621 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
6. BP-292116404-172.16.1.101-1479167821718:blk_1074171721_431651 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
7. BP-292116404-172.16.1.101-1479167821718:blk_1074171731_431661 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
8. BP-292116404-172.16.1.101-1479167821718:blk_1074171736_431666 len=28488507 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK]]

Status: HEALTHY
 Total size:	1102230331 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	9 (avg. block size 122470036 B)
 Minimally replicated blocks:	9 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	3.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		5
 Number of racks:		1
FSCK ended at Thu Jan 21 05:42:10 EST 2021 in 1 milliseconds


The filesystem under path '/public/randomtextwriter/part-m-00000' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Frandomtextwriter%2Fpart-m-00000
  • The default block size is 128 MB and it is set as part of hdfs-site.xml.

  • The property name is dfs.blocksize.

  • If the file size is smaller than default blocksize (128 MB), then there will be only one block as per the size of the file.

%%sh

cat /etc/hadoop/conf/hdfs-site.xml
  • Let us determine the number of blocks for /data/retail_db/orders/part-00000. If we store this file of size 2.9 MB in HDFS, there will be one block associated with it as size of the file is less than the block size.

  • It occupies 2.9 MB storage in HDFS (assuming replication factor as 1)

%%sh

ls -lhtr /data/retail_db/orders/part-00000
-rw-r--r-- 1 root root 2.9M Feb 20  2017 /data/retail_db/orders/part-00000
%%sh

hdfs fsck /user/${USER}/retail_db/orders/part-00000 -files -blocks -locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db/orders/part-00000 at Thu Jan 21 05:43:52 EST 2021
/user/itversity/retail_db/orders/part-00000 2999944 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455902_41737439 len=2999944 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]

Status: HEALTHY
 Total size:	2999944 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 2999944 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		5
 Number of racks:		1
FSCK ended at Thu Jan 21 05:43:52 EST 2021 in 1 milliseconds


The filesystem under path '/user/itversity/retail_db/orders/part-00000' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fuser%2Fitversity%2Fretail_db%2Forders%2Fpart-00000
  • Let us determine the number of blocks for /data/yelp-dataset-json/yelp_academic_dataset_user.json. If we store this file of size 2.4 GB in HDFS, there will be 19 blocks associated with it

    • 18 128 MB Files

    • 1 ~69 MB File

  • It occupies 2.4 GB storage in HDFS (assuming replication factor as 1)

%%sh

ls -lhtr /data/yelp-dataset-json/yelp_academic_dataset_user.json
-rwxr-xr-x 1 training training 2.4G Feb  5  2019 /data/yelp-dataset-json/yelp_academic_dataset_user.json
  • We can validate by using hdfs fsck command against the same file in HDFS.

%%sh

hdfs fsck /public/yelp-dataset-json/yelp_academic_dataset_user.json \
    -files \
    -blocks \
    -locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/yelp-dataset-json/yelp_academic_dataset_user.json at Thu Jan 21 05:44:47 EST 2021
/public/yelp-dataset-json/yelp_academic_dataset_user.json 2485747393 bytes, 19 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1101225469_27499779 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
1. BP-292116404-172.16.1.101-1479167821718:blk_1101225470_27499780 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
2. BP-292116404-172.16.1.101-1479167821718:blk_1101225471_27499781 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
3. BP-292116404-172.16.1.101-1479167821718:blk_1101225472_27499782 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK]]
4. BP-292116404-172.16.1.101-1479167821718:blk_1101225473_27499783 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK]]
5. BP-292116404-172.16.1.101-1479167821718:blk_1101225474_27499784 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK]]
6. BP-292116404-172.16.1.101-1479167821718:blk_1101225475_27499785 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
7. BP-292116404-172.16.1.101-1479167821718:blk_1101225476_27499786 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
8. BP-292116404-172.16.1.101-1479167821718:blk_1101225477_27499787 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
9. BP-292116404-172.16.1.101-1479167821718:blk_1101225478_27499788 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK]]
10. BP-292116404-172.16.1.101-1479167821718:blk_1101225479_27499789 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
11. BP-292116404-172.16.1.101-1479167821718:blk_1101225480_27499790 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
12. BP-292116404-172.16.1.101-1479167821718:blk_1101225481_27499791 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
13. BP-292116404-172.16.1.101-1479167821718:blk_1101225482_27499792 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
14. BP-292116404-172.16.1.101-1479167821718:blk_1101225483_27499793 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
15. BP-292116404-172.16.1.101-1479167821718:blk_1101225484_27499794 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
16. BP-292116404-172.16.1.101-1479167821718:blk_1101225485_27499795 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
17. BP-292116404-172.16.1.101-1479167821718:blk_1101225486_27499796 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK]]
18. BP-292116404-172.16.1.101-1479167821718:blk_1101225487_27499797 len=69828289 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK]]

Status: HEALTHY
 Total size:	2485747393 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	19 (avg. block size 130828810 B)
 Minimally replicated blocks:	19 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		5
 Number of racks:		1
FSCK ended at Thu Jan 21 05:44:47 EST 2021 in 1 milliseconds


The filesystem under path '/public/yelp-dataset-json/yelp_academic_dataset_user.json' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Fyelp-dataset-json%2Fyelp_academic_dataset_user.json