Array data types¶

Zarr's Data Type Model¶

Zarr is designed for interoperability with NumPy, so if you are familiar with NumPy or any other N-dimensional array library, Zarr's model for array data types should seem familiar. However, Zarr data types have some unique features that are described in this document.

Zarr arrays operate under an essential design constraint: unlike NumPy arrays, Zarr arrays are designed to be stored and accessed by other Zarr implementations. This means that, among other things, Zarr data types must be serializable to metadata documents in accordance with the Zarr specifications, which adds some unique aspects to the Zarr data type model.

The following sections explain Zarr's data type model in greater detail and demonstrate the Zarr Python APIs for working with Zarr data types.

Array Data Types¶

Every Zarr array has a data type, which defines the meaning of the array's elements. An array's data type is encoded in the JSON metadata for the array. This means that the data type of an array must be JSON-serializable.

In Zarr V2, the data type of an array is stored in the dtype field in array metadata. Zarr V3 changed the name of this field to data_type and also defined new rules for the values that can be assigned to the data_type field.

For example, in Zarr V2, the boolean array data type was represented in array metadata as the string "|b1". In Zarr V3, the same type is represented as the string "bool".

Scalars¶

Zarr also specifies how array elements, i.e., scalars, are encoded in array metadata. This is necessary because Zarr uses a field in array metadata to define a default value for chunks that are not stored. This field, called fill_value in both Zarr V2 and Zarr V3 metadata documents, contains a JSON value that can be decoded to a scalar value compatible with the array's data type.

For the boolean data type, the scalar encoding is simple—booleans are natively supported by JSON, so Zarr saves booleans as JSON booleans. Other scalars, like floats or raw bytes, have more elaborate encoding schemes, and in some cases, this scheme depends on the Zarr format version.

Data Types in Zarr Version 2¶

Version 2 of the Zarr format defined its data types relative to NumPy's data types, and added a few non-NumPy data types as well. With one exception (structured data types), the Zarr V2 JSON identifier for a data type is just the NumPy str attribute of that data type:

import zarr
import numpy as np
import json

store = {}
np_dtype = np.dtype('int64')
print(np_dtype.str)

<i8

z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2)
dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
print(dtype_meta)

<i8

Note

The < character in the data type metadata encodes the endianness, or "byte order," of the data type. As per the NumPy model, in Zarr version 2 each data type has an endianness where applicable. However, Zarr version 3 data types do not store endianness information.

There are two special cases to consider: "structured" data types, and "object" data types.

Structured Data Type¶

NumPy allows the construction of a so-called "structured" data types comprised of ordered collections of named fields, where each field is itself a distinct NumPy data type. See the NumPy documentation here.

Crucially, NumPy does not use a special data type for structured data types—instead, NumPy implements structured data types as an optional feature of the so-called "Void" data type, which models arbitrary fixed-size byte strings. The str attribute of a regular NumPy void data type is the same as the str of a NumPy structured data type. This means that the str attribute does not convey information about the fields contained in a structured data type. For these reasons, Zarr V2 uses a special data type encoding for structured data types. They are stored in JSON as lists of pairs, where the first element is a string, and the second element is a Zarr V2 data type specification. This representation supports recursion.

For example:

store = {}
np_dtype = np.dtype([('field_a', '>i2'), ('field_b', [('subfield_c', '>f4'), ('subfield_d', 'i2')])])
print(np_dtype.str)

|V8

z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2)
dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
print(dtype_meta)

[['field_a', '>i2'], ['field_b', [['subfield_c', '>f4'], ['subfield_d', '<i2']]]]

Object Data Type¶

The NumPy "object" type is essentially an array of references to arbitrary Python objects. It can model arrays of variable-length UTF-8 strings, arrays of variable-length byte strings, or even arrays of variable-length arrays, each with a distinct data type. This makes the "object" data type expressive, but also complicated to store.

Zarr Python cannot persistently store references to arbitrary Python objects. But if each of those Python objects has a consistent type, then we can use a special encoding procedure to store the array. This is how Zarr Python stores variable-length UTF-8 strings, or variable-length byte strings.

Although these are separate data types in this library, they are both "object" arrays in NumPy, which means they have the same Zarr V2 string representation: "|O".

So for Zarr V2 we have to disambiguate different "object" data type arrays on the basis of their encoding procedure, i.e., the codecs declared in the filters and compressor attributes of array metadata.

If an array with data type "object" used the "vlen-utf8" codec, then it was interpreted as an array of variable-length strings. If an array with data type "object" used the "vlen-bytes" codec, then it was interpreted as an array of variable-length byte strings.

This all means that the dtype field alone does not fully specify a data type in Zarr V2. The name of the object codec used, if one was used, is also required. Although this fact can be ignored for many simple numeric data types, any comprehensive approach to Zarr V2 data types must either reject the "object" data types or include the "object codec" identifier in the JSON form of the basic data type model.

Data Types in Zarr Version 3¶

The NumPy-based Zarr V2 data type representation was effective for simple data types but struggled with more complex data types, like "object" and "structured" data types. To address these limitations, Zarr V3 introduced several key changes to how data types are represented:

Instead of copying NumPy character codecs, Zarr V3 defines an identifier for each data type. The basic data types are identified by strings like "int8", "int16", etc., and data types that require a configuration can be identified by a JSON object.

For example, this JSON object declares a datetime data type:

{
  "name": "numpy.datetime64",
  "configuration": {
    "unit": "s",
    "scale_factor": 10
  }
}

Zarr V3 data types do not have endianness. This is a departure from Zarr V2, where multi-byte data types are defined with endianness information. Instead, Zarr V3 requires that the endianness of encoded array chunks is specified in the codecs attribute of array metadata. The Zarr V3 specification leaves the in-memory endianness of decoded array chunks as an implementation detail.

For more about data types in Zarr V3, see the V3 specification.

Data Types in Zarr Python¶

The two Zarr formats that Zarr Python supports specify data types in different ways: data types in Zarr version 2 are encoded as NumPy-compatible strings (or lists, in the case of structured data types), while data types in Zarr V3 are encoded as either strings or JSON objects. Zarr V3 data types do not have any associated endianness information, unlike Zarr V2 data types.

Zarr Python needs to support both Zarr V2 and V3, which means we need to abstract over these differences. We do this with an abstract Zarr data type class: ZDType which provides Zarr V2 and Zarr V3 compatibility routines for "native" data types.

In this context, a "native" data type is a Python class, typically defined in another library, that models an array's data type. For example, numpy.dtypes.UInt8DType is a native data type defined in NumPy. Zarr Python wraps the NumPy uint8 with a ZDType instance called UInt8.

As of this writing, the only native data types Zarr Python supports are NumPy data types. We could avoid the "native data type" jargon and just say "NumPy data type," but we do not want to rule out the possibility of using non-NumPy array backends in the future.

Each data type supported by Zarr Python is modeled by a ZDType subclass, which provides an API for the following operations:

Encoding and decoding a native data type
Encoding and decoding a data type to and from Zarr V2 and Zarr V3 array metadata
Encoding and decoding a scalar value to and from Zarr V2 and Zarr V3 array metadata
Casting a Python object to a scalar value consistent with the data type

List of data types¶

The following section lists the data types built in to Zarr Python. With a few exceptions, Zarr Python supports nearly all of the data types in NumPy. If you need a data type that is not listed here, it's possible to create it yourself: see Adding New Data Types.

Boolean¶

Boolean

Integral¶

Floating-point¶

String¶

Bytes¶

Temporal¶

Struct-like¶

Structured

Example Usage¶

This section will demonstrates the basic usage of Zarr data types.

Create a ZDType from a native data type:

from zarr.core.dtype import Int8
import numpy as np
int8 = Int8.from_native_dtype(np.dtype('int8'))

Convert back to a native data type:

native_dtype = int8.to_native_dtype()
assert native_dtype == np.dtype('int8')

Get the default scalar value for the data type:

default_value = int8.default_scalar()
assert default_value == np.int8(0)

Serialize to JSON for Zarr V2:

json_v2 = int8.to_json(zarr_format=2)
print(json_v2)
{'name': '|i1', 'object_codec_id': None}

{'name': '|i1', 'object_codec_id': None}

Note

The representation returned by to_json(zarr_format=2) is more abstract than the literal contents of Zarr V2 array metadata, because the JSON representation used by the ZDType classes must be distinct across different data types. As noted earlier, Zarr V2 identifies multiple distinct data types with the "object" data type identifier "|O". Extra information is needed to disambiguate these data types from one another. That's the reason for the object_codec_id field you see here.

And for V3:

json_v3 = int8.to_json(zarr_format=3)
print(json_v3)

int8

Serialize a scalar value to JSON:

json_value = int8.to_json_scalar(42, zarr_format=3)
print(json_value)

Deserialize a scalar value from JSON:

scalar_value = int8.from_json_scalar(42, zarr_format=3)
assert scalar_value == np.int8(42)

Adding New Data Types¶

Each Zarr data type is a separate Python class that inherits from ZDType. You can define a custom data type by writing your own subclass of ZDType and adding your data type to the data type registry. A complete example of this process is included below.

The source code for this example can be found in the examples/custom_dtype.py file in the Zarr Python project directory.

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "zarr @ git+https://github.com/zarr-developers/zarr-python.git@main",
#   "ml_dtypes==0.5.1",
#   "pytest==8.4.1"
# ]
# ///
#

"""
Demonstrate how to extend Zarr Python by defining a new data type
"""

import json
import sys
from pathlib import Path
from typing import ClassVar, Literal, Self, TypeGuard, overload

import ml_dtypes  # necessary to add extra dtypes to NumPy
import numpy as np
import pytest

import zarr
from zarr.core.common import JSON, ZarrFormat
from zarr.core.dtype import ZDType, data_type_registry
from zarr.core.dtype.common import (
    DataTypeValidationError,
    DTypeConfig_V2,
    DTypeJSON,
    check_dtype_spec_v2,
)

# This is the int2 array data type
int2_dtype_cls = type(np.dtype("int2"))

# This is the int2 scalar type
int2_scalar_cls = ml_dtypes.int2


class Int2(ZDType[int2_dtype_cls, int2_scalar_cls]):
    """
    This class provides a Zarr compatibility layer around the int2 data type (the ``dtype`` of a
    NumPy array of type int2) and the int2 scalar type (the ``dtype`` of the scalar value inside an int2 array).
    """

    # This field is as the key for the data type in the internal data type registry, and also
    # as the identifier for the data type when serializaing the data type to disk for zarr v3
    _zarr_v3_name: ClassVar[Literal["int2"]] = "int2"
    # this field will be used internally
    _zarr_v2_name: ClassVar[Literal["int2"]] = "int2"

    # we bind a class variable to the native data type class so we can create instances of it
    dtype_cls = int2_dtype_cls

    @classmethod
    def from_native_dtype(cls, dtype: np.dtype) -> Self:
        """Create an instance of this ZDType from a native dtype."""
        if cls._check_native_dtype(dtype):
            return cls()
        raise DataTypeValidationError(
            f"Invalid data type: {dtype}. Expected an instance of {cls.dtype_cls}"
        )

    def to_native_dtype(self: Self) -> int2_dtype_cls:
        """Create an int2 dtype instance from this ZDType"""
        return self.dtype_cls()

    @classmethod
    def _check_json_v2(cls, data: DTypeJSON) -> TypeGuard[DTypeConfig_V2[Literal["|b1"], None]]:
        """
        Type check for Zarr v2-flavored JSON.

        This will check that the input is a dict like this:
        .. code-block:: json

        {
            "name": "int2",
            "object_codec_id": None
        }

        Note that this representation differs from the ``dtype`` field looks like in zarr v2 metadata.
        Specifically, whatever goes into the ``dtype`` field in metadata is assigned to the ``name`` field here.

        See the Zarr docs for more information about the JSON encoding for data types.
        """
        return (
            check_dtype_spec_v2(data) and data["name"] == "int2" and data["object_codec_id"] is None
        )

    @classmethod
    def _check_json_v3(cls, data: DTypeJSON) -> TypeGuard[Literal["int2"]]:
        """
        Type check for Zarr V3-flavored JSON.

        Checks that the input is the string "int2".
        """
        return data == cls._zarr_v3_name

    @classmethod
    def _from_json_v2(cls, data: DTypeJSON) -> Self:
        """
        Create an instance of this ZDType from Zarr V3-flavored JSON.
        """
        if cls._check_json_v2(data):
            return cls()
        #  This first does a type check on the input, and if that passes we create an instance of the ZDType.
        msg = f"Invalid JSON representation of {cls.__name__}. Got {data!r}, expected the string {cls._zarr_v2_name!r}"
        raise DataTypeValidationError(msg)

    @classmethod
    def _from_json_v3(cls: type[Self], data: DTypeJSON) -> Self:
        """
        Create an instance of this ZDType from Zarr V3-flavored JSON.

        This first does a type check on the input, and if that passes we create an instance of the ZDType.
        """
        if cls._check_json_v3(data):
            return cls()
        msg = f"Invalid JSON representation of {cls.__name__}. Got {data!r}, expected the string {cls._zarr_v3_name!r}"
        raise DataTypeValidationError(msg)

    @overload  # type: ignore[override]
    def to_json(self, zarr_format: Literal[2]) -> DTypeConfig_V2[Literal["int2"], None]: ...

    @overload
    def to_json(self, zarr_format: Literal[3]) -> Literal["int2"]: ...

    def to_json(
        self, zarr_format: ZarrFormat
    ) -> DTypeConfig_V2[Literal["int2"], None] | Literal["int2"]:
        """
        Serialize this ZDType to v2- or v3-flavored JSON

        If the zarr_format is 2, then return a dict like this:
        .. code-block:: json

        {
            "name": "int2",
            "object_codec_id": None
        }

        If the zarr_format is 3, then return the string "int2"

        """
        if zarr_format == 2:
            return {"name": "int2", "object_codec_id": None}
        elif zarr_format == 3:
            return self._zarr_v3_name
        raise ValueError(f"zarr_format must be 2 or 3, got {zarr_format}")  # pragma: no cover

    def _check_scalar(self, data: object) -> TypeGuard[int | ml_dtypes.int2]:
        """
        Check if a python object is a valid int2-compatible scalar

        The strictness of this type check is an implementation degree of freedom.
        You could be strict here, and only accept int2 values, or be open and accept any integer
        or any object and rely on exceptions from the int2 constructor that will be called in
        cast_scalar.
        """
        return isinstance(data, (int, int2_scalar_cls))

    def cast_scalar(self, data: object) -> ml_dtypes.int2:
        """
        Attempt to cast a python object to an int2.

        We first perform a type check to ensure that the input type is appropriate, and if that
        passes we call the int2 scalar constructor.
        """
        if self._check_scalar(data):
            return ml_dtypes.int2(data)
        msg = (
            f"Cannot convert object {data!r} with type {type(data)} to a scalar compatible with the "
            f"data type {self}."
        )
        raise TypeError(msg)

    def default_scalar(self) -> ml_dtypes.int2:
        """
        Get the default scalar value. This will be used when automatically selecting a fill value.
        """
        return ml_dtypes.int2(0)

    def to_json_scalar(self, data: object, *, zarr_format: ZarrFormat) -> int:
        """
        Convert a python object to a JSON representation of an int2 scalar.
        This is necessary for taking user input for the ``fill_value`` attribute in array metadata.

        In this implementation, we optimistically convert the input to an int,
        and then check that it lies in the acceptable range for this data type.
        """
        # We could add a type check here, but we don't need to for this example
        val: int = int(data)  # type: ignore[call-overload]
        if val not in (-2, -1, 0, 1):
            raise ValueError("Invalid value. Expected -2, -1, 0, or 1.")
        return val

    def from_json_scalar(self, data: JSON, *, zarr_format: ZarrFormat) -> ml_dtypes.int2:
        """
        Read a JSON-serializable value as an int2 scalar.

        We first perform a type check to ensure that the JSON value is well-formed, then call the
        int2 scalar constructor.

        The base definition of this method requires that it take a zarr_format parameter because
        other data types serialize scalars differently in zarr v2 and v3, but we don't use this here.

        """
        if self._check_scalar(data):
            return ml_dtypes.int2(data)
        raise TypeError(f"Invalid type: {data}. Expected an int.")


# after defining dtype class, it must be registered with the data type registry so zarr can use it
data_type_registry.register(Int2._zarr_v3_name, Int2)


# this parametrized function will create arrays in zarr v2 and v3 using our new data type
@pytest.mark.parametrize("zarr_format", [2, 3])
def test_custom_dtype(tmp_path: Path, zarr_format: Literal[2, 3]) -> None:
    # create array and write values
    z_w = zarr.create_array(
        store=tmp_path, shape=(4,), dtype="int2", zarr_format=zarr_format, compressors=None
    )
    z_w[:] = [-1, -2, 0, 1]

    # open the array
    z_r = zarr.open_array(tmp_path, mode="r")

    print(z_r.info_complete())

    # look at the array metadata
    if zarr_format == 2:
        meta_file = tmp_path / ".zarray"
    else:
        meta_file = tmp_path / "zarr.json"
    print(json.dumps(json.loads(meta_file.read_text()), indent=2))


if __name__ == "__main__":
    # Run the example with printed output, and a dummy pytest configuration file specified.
    # Without the dummy configuration file, at test time pytest will attempt to use the
    # configuration file in the project root, which will error because Zarr is using some
    # plugins that are not installed in this example.
    sys.exit(pytest.main(["-s", __file__, f"-c {__file__}"]))

Data Type Resolution¶

Although Zarr Python uses a different data type model from NumPy, you can still define a Zarr array with a NumPy data type object:

from zarr import create_array
import numpy as np
a = create_array({}, shape=(10,), dtype=np.dtype('int'))
print(a)

<Array memory://125376272725376 shape=(10,) dtype=int64>

Or a string representation of a NumPy data type:

a = create_array({}, shape=(10,), dtype='<i8')
print(a)

<Array memory://125375825080192 shape=(10,) dtype=int64>

The Array object presents itself like a NumPy array, including exposing a NumPy data type as its dtype attribute:

print(type(a.dtype))

<class 'numpy.dtypes.Int64DType'>

But if we inspect the metadata for the array, we can see the Zarr data type object:

type(a.metadata.data_type)
<class 'zarr.core.dtype.npy.int.Int64'>

This example illustrates a general problem Zarr Python has to solve: how can we allow users to specify a data type as a string or a NumPy dtype object, and produce the right Zarr data type from that input? We call this process "data type resolution." Zarr Python also performs data type resolution when reading stored arrays, although in this case the input is a JSON value instead of a NumPy data type.

For simple data types like int, the solution could be extremely simple: just maintain a lookup table that maps a NumPy data type to the Zarr data type equivalent. But not all data types are so simple. Consider this case:

from zarr import create_array
import warnings
import numpy as np
warnings.simplefilter("ignore", category=FutureWarning)
a = create_array({}, shape=(10,), dtype=[('a', 'f8'), ('b', 'i8')])
print(a.dtype) # this is the NumPy data type

[('a', '<f8'), ('b', '<i8')]

print(a.metadata.data_type) # this is the Zarr data type

Structured(fields=(('a', Float64(endianness='little')), ('b', Int64(endianness='little'))))

In this example, we created a NumPy structured data type. This data type is a container that can hold any NumPy data type, which makes it recursive. It is not possible to make a lookup table that relates all NumPy structured data types to their Zarr equivalents, as there is a nearly unbounded number of different structured data types. So instead of a static lookup table, Zarr Python relies on a dynamic approach to data type resolution.

Zarr Python defines a collection of Zarr data types. This collection, called a "data type registry," is essentially a dictionary where the keys are strings (a canonical name for each data type), and the values are the data type classes themselves. Dynamic data type resolution entails iterating over these data type classes, invoking that class' from_native_dtype method, and returning a concrete data type instance if and only if exactly one of those constructor invocations is successful.

In plain language, we take some user input, like a NumPy data type, offer it to all the known data type classes, and return an instance of the one data type class that can accept that user input.

We want to avoid a situation where the same native data type matches multiple Zarr data types; that is, a NumPy data type should uniquely specify a single Zarr data type. But data type resolution is dynamic, so it's not possible to statically guarantee this uniqueness constraint. Therefore, we attempt data type resolution against every data type class, and if, for some reason, a native data type matches multiple Zarr data types, we treat this as an error and raise an exception.

If you have a NumPy data type and you want to get the corresponding ZDType instance, you can use the parse_dtype function, which will use the dynamic resolution described above. parse_dtype handles a range of input types:

NumPy data types:

import numpy as np
from zarr.dtype import parse_dtype
my_dtype = np.dtype('>M8[10s]')
print(parse_dtype(my_dtype, zarr_format=2))

DateTime64(endianness='big', scale_factor=10, unit='s')

NumPy data type-compatible strings:

dtype_str = '>M8[10s]'
print(parse_dtype(dtype_str, zarr_format=2))

DateTime64(endianness='big', scale_factor=10, unit='s')

ZDType instances:

from zarr.dtype import DateTime64
zdt = DateTime64(endianness='big', scale_factor=10, unit='s')
print(parse_dtype(zdt, zarr_format=2)) # Use a ZDType (this is a no-op)

DateTime64(endianness='big', scale_factor=10, unit='s')

Python dictionaries (requires zarr_format=3). These dictionaries must be consistent with the JSON form of the data type:

dt_dict = {"name": "numpy.datetime64", "configuration": {"unit": "s", "scale_factor": 10}}
print(parse_dtype(dt_dict, zarr_format=3))

DateTime64(endianness='little', scale_factor=10, unit='s')

print(parse_dtype(dt_dict, zarr_format=3).to_json(zarr_format=3))

{'name': 'numpy.datetime64', 'configuration': {'unit': 's', 'scale_factor': 10}}