HDFS Configuration¶
Defaults¶
Several methods are available for configuring HDFS3.
The simplest is to load values from core-site.xml
and hdfs-site.xml
files.
HDFS3 will search typical locations and reads default configuration parameters from there.
The file locations may also be specified with the environment variables HADOOP_CONF_DIR
,
which is the directory containing the XLM files, HADOOP_INSTALL
, in which case the
files are expected in subdirectory hadoop/conf/
or LIBHDFS3_CONF
, which should
explicitly point to the hdfs-site.xml
file you wish to use.
It is also possible to pass parameters to HDFS3 when instantiating the file system. You
can either provide individual common overrides (e.g., host='myhost'
) or provide
a whole configuration as a dictionary (pars={}
) with the same key names as typically
contained in the XML config files. These parameters will take precedence over any loaded
from files, or you can disable using the default configuration at all with autoconf=False
.
The special environment variable LIBHDFS3_CONF
will be automatically set when parsing
the config files, if possible. Since the library is only loaded upon the first instantiation
of a HDFileSystem, you still have the option to change its value in os.environ
.
Short-circuit reads in HDFS¶
Typically in HDFS, all data reads go through the datanode. Alternatively, a process that runs on the same node as the data can bypass or short-circuit the communication path through the datanode and instead read directly from a file.
HDFS and hdfs3
can be configured for short-circuit reads. The easiest method is
to edit the hdfs-site.xml
file whose location you specify as above.
- Configure the appropriate settings in
hdfs-site.xml
on all of the HDFS nodes:
<configuration>
<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
</property>
<property>
<name>dfs.domain.socket.path</name>
<value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
</configuration>
The above configuration changes should allow for short-circuit reads. If you
continue to receive warnings to retry the same node but disable read
shortcircuit feature
, check the above settings. Note that the HDFS reads
should still function despite the warning, but performance might be impacted.
For more information about configuring short-circuit reads, refer to the HDFS Short-Circuit Local Reads documentation.
High-availability mode¶
Although HDFS is resilient to failure of data-nodes, the name-node is a single repository of metadata for the system, and so a single point of failure. High-availability (HA) involves configuring fall-back name-nodes which can take over in the event of failure. A good description han be found `here`_.
In the case of `libhdfs3`_, the library used by hdfs3, the configuration required for HA can be passed to the client directly in python code, or included in configuration files, as with any other configuration options.
In python code, this could look like the following:
host = "nameservice1"
conf = {"dfs.nameservices": "nameservice1",
"dfs.ha.namenodes.nameservice1": "namenode113,namenode188",
"dfs.namenode.rpc-address.nameservice1.namenode113": "hostname_of_server1:8020",
"dfs.namenode.rpc-address.nameservice1.namenode188": "hostname_of_server2:8020",
"dfs.namenode.http-address.nameservice1.namenode188": "hostname_of_server1:50070",
"dfs.namenode.http-address.nameservice1.namenode188": "hostname_of_server2:50070",
"hadoop.security.authentication": "kerberos"
}
fs = HDFileSystem(host=host, pars=conf)
Note that no port
is specified (requires hdfs version 0.1.3), it’s value should be None
.