## HDFS Blocksize

Let us get into details related to blocksize in HDFS.
* HDFS stands for Hadoop Distributed File System.
* It means the large files will be physically stored on multiple nodes in distributed fashion.
* Let us review the `hdfs fsck` output of `/public/randomtextwriter/part-m-00000`. The file is approximately 1 GB in size and you will see 9 files.
  * 8 files of size 128 MB
  * 1 file of size 28 MB approximately
* It means a file of size 1 GB 28 MB is stored in 9 blocks. It is due to the default block size which is 128 MB.

In [1]:
%%sh

hdfs dfs -ls -h /public/randomtextwriter/part-m-00000

-rw-r--r--   3 hdfs hdfs      1.0 G 2017-01-18 20:24 /public/randomtextwriter/part-m-00000


In [2]:
%%sh

hdfs fsck /public/randomtextwriter/part-m-00000 \
    -files \
    -blocks \
    -locations

FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/randomtextwriter/part-m-00000 at Thu Jan 21 05:42:10 EST 2021
/public/randomtextwriter/part-m-00000 1102230331 bytes, 9 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1074171511_431441 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
1. BP-292116404-172.16.1.101-1479167821718:blk_1074171524_431454 len=134217728 repl=3 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
2. BP-292116404-172.16.1.101-1479167821718:blk_1074171559_431489 len=1342177

Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Frandomtextwriter%2Fpart-m-00000


* The default block size is 128 MB and it is set as part of hdfs-site.xml.
* The property name is `dfs.blocksize`.
* If the file size is smaller than default blocksize (128 MB), then there will be only one block as per the size of the file.

In [None]:
%%sh

cat /etc/hadoop/conf/hdfs-site.xml

* Let us determine the number of blocks for `/data/retail_db/orders/part-00000`. If we store this file of size 2.9 MB in HDFS, there will be one block associated with it as size of the file is less than the block size.
* It occupies 2.9 MB storage in HDFS (assuming replication factor as 1)

In [4]:
%%sh

ls -lhtr /data/retail_db/orders/part-00000

-rw-r--r-- 1 root root 2.9M Feb 20  2017 /data/retail_db/orders/part-00000


In [5]:
%%sh

hdfs fsck /user/${USER}/retail_db/orders/part-00000 -files -blocks -locations

FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /user/itversity/retail_db/orders/part-00000 at Thu Jan 21 05:43:52 EST 2021
/user/itversity/retail_db/orders/part-00000 2999944 bytes, 1 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1115455902_41737439 len=2999944 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]

Status: HEALTHY
 Total size:	2999944 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 2999944 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	2
 Average block replication:	2.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		5
 Number of racks:		1
FSCK ended at Thu Jan 21 05:43:52 EST 2021 in 1 millisecon

Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fuser%2Fitversity%2Fretail_db%2Forders%2Fpart-00000


* Let us determine the number of blocks for `/data/yelp-dataset-json/yelp_academic_dataset_user.json`. If we store this file of size 2.4 GB in HDFS, there will be 19 blocks associated with it
  * 18 128 MB Files
  * 1 ~69 MB File
* It occupies 2.4 GB storage in HDFS (assuming replication factor as 1)

In [6]:
%%sh

ls -lhtr /data/yelp-dataset-json/yelp_academic_dataset_user.json

-rwxr-xr-x 1 training training 2.4G Feb  5  2019 /data/yelp-dataset-json/yelp_academic_dataset_user.json


* We can validate by using `hdfs fsck` command against the same file in HDFS.

In [7]:
%%sh

hdfs fsck /public/yelp-dataset-json/yelp_academic_dataset_user.json \
    -files \
    -blocks \
    -locations

FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/yelp-dataset-json/yelp_academic_dataset_user.json at Thu Jan 21 05:44:47 EST 2021
/public/yelp-dataset-json/yelp_academic_dataset_user.json 2485747393 bytes, 19 block(s):  OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1101225469_27499779 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
1. BP-292116404-172.16.1.101-1479167821718:blk_1101225470_27499780 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
2. BP-292116404-172.16.1.101-1479167821718:blk_1101225471_27499781 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1

Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Fyelp-dataset-json%2Fyelp_academic_dataset_user.json
