hdfs3

This project is not undergoing development

Pyarrow’s JNI hdfs interface is mature and stable. It also has fewer problems with configuration and various security settings, and does not require the complex build process of libhdfs3. Therefore, all users who have trouble with hdfs3 are recommended to try pyarrow.

Introduction

Use HDFS natively from Python.

The Hadoop File System (HDFS) is a widely deployed, distributed, data-local file system written in Java. This file system backs most clusters running Hadoop and Spark.

Pivotal produced libhdfs3, an alternative native C/C++ HDFS client that interacts with HDFS without the JVM, exposing first class support to non-JVM languages like Python.

This library, hdfs3, is a lightweight Python wrapper around the C/C++ libhdfs3 library. It provides both direct access to libhdfs3 from Python as well as a typical Pythonic interface.

>>> from hdfs3 import HDFileSystem
>>> hdfs = HDFileSystem(host='localhost', port=8020)
>>> hdfs.ls('/user/data')
>>> hdfs.put('local-file.txt', '/user/data/remote-file.txt')
>>> hdfs.cp('/user/data/file.txt', '/user2/data')

HDFS3 files comply with the Python File interface. This enables interactions with the broader ecosystem of PyData projects.

>>> with hdfs.open('/user/data/file.txt') as f:
...     data = f.read(1000000)

>>> with hdfs.open('/user/data/file.csv.gz') as f:
...     df = pandas.read_csv(f, compression='gzip', nrows=1000)

Motivation

We choose to use an alternative C/C++/Python HDFS client rather than the default JVM client for the following reasons:

  • Convenience: Interactions between Java libraries and Native (C/C++/Python) libraries can be cumbersome. Using a native library from Python smoothes over the experience in development, maintenance, and debugging.
  • Performance: Native libraries like libhdfs3 do not suffer the long JVM startup times, improving interaction.