HDFS Replication Factor¶
Let us get an overview of replication factor - another important building block of HDFS.
While blocksize drives distribution of large files, replication factor drives reliability of the files.
If we only have one copy of each block for a given file and if the node goes down, then the data in the files is not readable.
HDFS replication mitigates this by maintaining multiple copies of each block.
Keep in mind that the default replication factor is 3 unless we override it.
%%sh
hdfs dfs -ls -h /public/retail_db/orders
Found 1 items
-rw-r--r-- 2 hdfs hdfs 2.9 M 2020-07-14 01:35 /public/retail_db/orders/part-00000
%%sh
hdfs fsck /public/retail_db/orders/part-00000 \
-files \
-blocks \
-locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/retail_db/orders/part-00000 at Wed Jan 27 17:16:31 EST 2021
/public/retail_db/orders/part-00000 2999944 bytes, 1 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1110719773_36998835 len=2999944 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK]]
Status: HEALTHY
Total size: 2999944 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 2999944 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Wed Jan 27 17:16:31 EST 2021 in 0 milliseconds
The filesystem under path '/public/retail_db/orders/part-00000' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Fretail_db%2Forders%2Fpart-00000
As part of our lab cluster we maintain 2 copies of each block.
In production implementations, typically we have 3 copies with rack awareness enabled.
The default replication factor is 3 and it is set as part of hdfs-site.xml. In our case we have overridden to save the storage.
The property name is
dfs.replication
.If the file size is smaller than default blocksize (128 MB), then there will be only one block as per the size of the file.
In a typical configuration with n replication factor, there will not be any down time even if n - 1 nodes go down in the cluster.
If replication factor is 3, cluster will be stable even if 2 of the nodes goes down in a cluster.
Replication factor covers all the hardware failures of the hosts.
In Production, we typically configure Rack Awareness which will get us much better reliability.
%%sh
grep -B 1 -A 3 replication /etc/hadoop/conf/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.replication.max</name>
<value>50</value>
</property>
Let us determine overall size occupied by
/data/retail_db/orders/part-00000
when it is copied to HDFS.It occupies 5.8 MB storage in HDFS (as our replication factor is 2).
%%sh
ls -lhtr /data/retail_db/orders/part-00000
-rw-r--r-- 1 root root 2.9M Feb 20 2017 /data/retail_db/orders/part-00000
%%sh
hdfs dfs -help stat
-stat [format] <path> ... :
Print statistics about the file/directory at <path>
in the specified format. Format accepts filesize in
blocks (%b), type (%F), group name of owner (%g),
name (%n), block size (%o), replication (%r), user name
of owner (%u), modification date (%y, %Y).
%y shows UTC date as "yyyy-MM-dd HH:mm:ss" and
%Y shows milliseconds since January 1, 1970 UTC.
If the format is not specified, %y is used by default.
%%sh
hdfs dfs -stat %r /user/${USER}/retail_db/orders/part-00000
2
%%sh
hdfs dfs -stat %o /user/${USER}/retail_db/orders/part-00000
134217728
%%sh
hdfs dfs -stat %b /user/${USER}/retail_db/orders/part-00000
2999944
Let’s review yelp_academic_dataset_user.json. It is of size 2.4 GB and it occupies 4.8 GB storage in HDFS as our replication factor is 2.
%%sh
ls -lhtr /data/yelp-dataset-json/yelp_academic_dataset_user.json
-rwxr-xr-x 1 training training 2.4G Feb 5 2019 /data/yelp-dataset-json/yelp_academic_dataset_user.json
We can validate properties of the file using
stat
command. The file is available in HDFS under/public/yelp-dataset-json/yelp_academic_dataset_user.json
.
%%sh
hdfs fsck /public/yelp-dataset-json/yelp_academic_dataset_user.json \
-files \
-blocks \
-locations
FSCK started by itversity (auth:SIMPLE) from /172.16.1.114 for path /public/yelp-dataset-json/yelp_academic_dataset_user.json at Wed Jan 27 17:29:23 EST 2021
/public/yelp-dataset-json/yelp_academic_dataset_user.json 2485747393 bytes, 19 block(s): OK
0. BP-292116404-172.16.1.101-1479167821718:blk_1101225469_27499779 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
1. BP-292116404-172.16.1.101-1479167821718:blk_1101225470_27499780 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
2. BP-292116404-172.16.1.101-1479167821718:blk_1101225471_27499781 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
3. BP-292116404-172.16.1.101-1479167821718:blk_1101225472_27499782 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK]]
4. BP-292116404-172.16.1.101-1479167821718:blk_1101225473_27499783 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK]]
5. BP-292116404-172.16.1.101-1479167821718:blk_1101225474_27499784 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK]]
6. BP-292116404-172.16.1.101-1479167821718:blk_1101225475_27499785 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
7. BP-292116404-172.16.1.101-1479167821718:blk_1101225476_27499786 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
8. BP-292116404-172.16.1.101-1479167821718:blk_1101225477_27499787 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
9. BP-292116404-172.16.1.101-1479167821718:blk_1101225478_27499788 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK]]
10. BP-292116404-172.16.1.101-1479167821718:blk_1101225479_27499789 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
11. BP-292116404-172.16.1.101-1479167821718:blk_1101225480_27499790 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-7fb58858-abe9-4a52-9b75-755d849a897b,DISK]]
12. BP-292116404-172.16.1.101-1479167821718:blk_1101225481_27499791 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
13. BP-292116404-172.16.1.101-1479167821718:blk_1101225482_27499792 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK], DatanodeInfoWithStorage[172.16.1.103:50010,DS-1f4edfab-2926-45f9-a37c-ae9d1f542680,DISK]]
14. BP-292116404-172.16.1.101-1479167821718:blk_1101225483_27499793 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-1edb1d35-81bf-471b-be04-11d973e2a832,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
15. BP-292116404-172.16.1.101-1479167821718:blk_1101225484_27499794 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.102:50010,DS-b0f1636e-fd08-4ddb-bba9-9df8868dfb5d,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-736614f7-27de-46b8-987f-d669be6a32a3,DISK]]
16. BP-292116404-172.16.1.101-1479167821718:blk_1101225485_27499795 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK], DatanodeInfoWithStorage[172.16.1.108:50010,DS-698dde50-a336-4e00-bc8f-a9e1a5cc76f4,DISK]]
17. BP-292116404-172.16.1.101-1479167821718:blk_1101225486_27499796 len=134217728 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-f4667aac-0f2c-463c-9584-d625928b9af5,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-a12c4ae3-3f6a-42fc-83ff-7779a9fc0482,DISK]]
18. BP-292116404-172.16.1.101-1479167821718:blk_1101225487_27499797 len=69828289 repl=2 [DatanodeInfoWithStorage[172.16.1.104:50010,DS-98fec5a6-72a9-4590-99cc-cee3a51f4dd5,DISK], DatanodeInfoWithStorage[172.16.1.107:50010,DS-6679d10e-378c-4897-8c0e-250aa1af790a,DISK]]
Status: HEALTHY
Total size: 2485747393 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 19 (avg. block size 130828810 B)
Minimally replicated blocks: 19 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Wed Jan 27 17:29:23 EST 2021 in 1 milliseconds
The filesystem under path '/public/yelp-dataset-json/yelp_academic_dataset_user.json' is HEALTHY
Connecting to namenode via http://172.16.1.101:50070/fsck?ugi=itversity&files=1&blocks=1&locations=1&path=%2Fpublic%2Fyelp-dataset-json%2Fyelp_academic_dataset_user.json
%%sh
hdfs dfs -stat %r /public/yelp-dataset-json/yelp_academic_dataset_user.json
2
%%sh
hdfs dfs -stat %o /public/yelp-dataset-json/yelp_academic_dataset_user.json
134217728
%%sh
hdfs dfs -stat %b /public/yelp-dataset-json/yelp_academic_dataset_user.json
2485747393