Section: ndctl Manual (1)
ndctl-create-namespace - provision or reconfigure a namespace
ndctl create-namespace [<options>]
THEORY OF OPERATION
The capacity of an NVDIMM REGION (contiguous span of persistent memory)
is accessed via one or more NAMESPACE devices. REGION is the Linux term
for what ACPI and UEFI call a DIMM-interleave-set, or a
system-physical-address-range that is striped (by the memory controller)
across one or more memory modules.
The UEFI specification defines the NVDIMM Label Protocol as the
combination of label area access methods and a data format for
provisioning one or more NAMESPACE objects from a REGION. Note that
label support is optional and if Linux does not detect the label
capability it will automatically instantiate a "label-less" namespace
per region. Examples of label-less namespaces are the ones created by
the kernel's memmap=ss!nn command line option (see the nvdimm wiki on
kernel.org), or NVDIMMs without a valid namespace index in their label
Label-less namespaces lack many of the features of their label-rich
cousins. For example, their size cannot be modified, or they cannot be
fully destroyed (i.e. the space reclaimed). A destroy operation will
zero any mode-specific metadata. Finally, for create-namespace operations
on label-less namespaces, ndctl bypasses the region capacity availability
checks, and always satisfies the request using the full region capacity.
The only reconfiguration operation supported on a label-less namespace
is changing its mode.
A namespace can be provisioned to operate in one of 4 modes, fsdax,
devdax, sector, and raw. Here are the expected usage models for
fsdax: Filesystem-DAX mode is the default mode of a namespace
when specifying ndctl create-namespace with no options. It creates
a block device (/dev/pmemX[.Y]) that supports the DAX capabilities
of Linux filesystems (xfs and ext4 to date). DAX removes the page
cache from the I/O path and allows mmap(2) to establish direct
mappings to persistent memory media. The DAX capability enables
workloads / working-sets that would exceed the capacity of the page
cache to scale up to the capacity of persistent memory. Workloads
that fit in page cache or perform bulk data transfers may not see
benefit from DAX. When in doubt, pick this mode.
devdax: Device-DAX mode enables similar mmap(2) DAX mapping
capabilities as Filesystem-DAX. However, instead of a block-device
that can support a DAX-enabled filesystem, this mode emits a single
character device file (/dev/daxX.Y). Use this mode to assign
persistent memory to a virtual-machine, register persistent memory
for RDMA, or when gigantic mappings are needed.
sector: Use this mode to host legacy filesystems that do
not checksum metadata or applications that are not prepared for torn
sectors after a crash. Expected usage for this mode is for small
boot volumes. This mode is compatible with other operating systems.
raw: Raw mode is effectively just a memory disk that does
not support DAX. Typically this indicates a namespace that was
created by tooling or another operating system that did not know how
to create a Linux fsdax or devdax mode namespace. This mode is
compatible with other operating systems, but again, does not support
Create a maximally sized pmem namespace in fsdax mode (the
Convert namespace0.0 to sector mode
ndctl create-namespace -f -e namespace0.0 --mode=sector
Create a pmem or blk namespace (subject to available
capacity). A pmem namespace supports the dax (direct access)
capability to mmap(2) persistent memory directly into
a process address space. A blk namespace access persistent
memory through a block-window-aperture. Compared to pmem it
supports a traditional storage error model (EIO on error rather
than a cpu exception on a bad memory access), but it does not
"raw": expose the namespace capacity directly with
limitations. Neither a raw pmem namepace nor raw blk
namespace support sector atomicity by default (see "sector"
mode below). A raw pmem namespace may have limited to no dax
support depending the kernel. In other words operations like
direct-I/O targeting a dax buffer may fail for a pmem
namespace in raw mode or indirect through a page-cache buffer.
See "fsdax" and "devdax" mode for dax operation.
"sector": persistent memory, given that it is byte
addressable, does not support sector atomicity. The
problematic aspect of sector tearing is that most applications
do not know they have a atomic sector update dependency. At
least a disk rarely ever tears sectors and if it does it
almost certainly returns a checksum error on access.
Persistent memory devices will always tear and always
silently. Until an application is audited to be robust in the
presence of sector-tearing "safe" mode is recommended. This
imposes some performance overhead and disables the dax
capability. (also known as "safe" or "btt" mode)
"fsdax": A pmem namespace in this mode supports dax
operation with a block-device based filesystem (in previous
ndctl releases this mode was named "memory" mode). This mode
comes at the cost of allocating per-page metadata. The
capacity can be allocated from "System RAM", or from a
reserved portion of "Persistent Memory" (see the --map=
option). NOTE: A filesystem that supports DAX is required
for dax operation. If the raw block device (/dev/pmemX) is
used directly without a filesystem, it will use the page
cache. See "devdax" mode for raw device access that supports
"devdax": The device-dax character device interface is a
statically allocated / raw access analogue of filesystem-dax
(in previous ndctl releases this mode was named "dax" mode).
It allows memory ranges to be mapped without need of an
intervening filesystem. The device-dax is interface strict,
precise and predictable. Specifically the interface:
Guarantees fault granularity with respect to a given page
size (4K, 2M, or 1G on x86) set at configuration time.
Enforces deterministic behavior by being strict about what
fault scenarios are supported. I.e. if a device is
configured with a 2M alignment an attempt to fault a 4K
aligned offset will result in SIGBUS.
For NVDIMM devices that support namespace labels, set the
namespace size in bytes. Otherwise it defaults to the maximum
size specified by platform firmware. This option supports the
suffixes "k" or "K" for KiB, "m" or "M" for MiB, "g" or "G" for
GiB and "t" or "T" for TiB.
For pmem namepsaces the size must be a multiple of the
interleave-width and the namespace alignment (see
Applications that want to establish dax memory mappings with
page table entries greater than system base page size (4K on
x86) need a persistent memory namespace that is sufficiently
aligned. For "fsdax" and "devdax" mode this defaults to 2M.
Note that "devdax" mode enforces all mappings to be aligned to
this value, i.e. it fails unaligned mapping attempts. The
"fsdax" alignment setting determines the starting alignment of
filesystem extents and may limit the possible granularities,
if a large mapping is not possible it will silently fall back
to a smaller page size.
Reconfigure an existing namespace (change the mode, sector size,
etc...). All namespace parameters, save uuid, default to the
current attributes of the specified namespace. The namespace is
then re-created with the specified modifications. The uuid is
refreshed to a new value by default whenever the data layout of
a namespace is changed, see --uuid= to set a specific uuid.
This option is not recommended as a new uuid should be generated
every time a namespace is (re-)created. For recovery scenarios
however the uuid may be specified.
For NVDIMM devices that support namespace labels,
specify a human friendly name for a namespace. This name is
available as a device attribute for use in udev rules.
Specify the logical sector size (LBA size) of the
Linux block device associated with an namespace.
A pmem namespace in "fsdax" or "devdax" mode requires allocation of
per-page metadata. The allocation can be drawn from either:
"mem": typical system memory
"dev": persistent memory reserved from the namespace
Given relative capacities of "Persistent Memory" to "System
RAM" the allocation defaults to reserving space out of the
namespace directly ("--map=dev"). The overhead is 64-bytes per
4K (16GB per 1TB) on x86.
Do not stop after creating one namespace. Instead, greedily create as
many namespaces as possible within the given --bus and --region filter
restrictions. This will abort if any creation attempt results in an
error unless --force is also supplied.
Unless this option is specified the reconfigure namespace
operation will fail if the namespace is presently active.
Specifying --force causes the namespace to be disabled before
the operation is attempted. However, if the namespace is
mounted then the disable namespace and reconfigure
namespace operations will be aborted. The namespace must be
unmounted before being reconfigured. When used in conjunction
with --continue, continue the namespace creation loop even
if an error is encountered for intermediate namespaces.
-L, --autolabel, --no-autolabel
Legacy NVDIMM devices do not support namespace labels. In that
case the kernel creates region-sized namespaces that can not
be deleted. Their mode can be changed, but they can not be
resized smaller than their parent region. This is termed a
"label-less namespace". In contrast, NVDIMMs and hypervisors
that support the ACPI 6.2 label area definition (ACPI 6.2
Section 6.5.10 NVDIMM Label Methods) support "labelled
There are two cases where the kernel will default to
NVDIMM does not support labels
The NVDIMM supports labels, but the Label Index Block (see
UEFI 2.7) is not present and there is no capacity aliasing
between blk and pmem regions.
In the latter case the configuration can be upgraded to
labelled operation by writing an index block on all DIMMs in a
region and re-enabling that region. The autolabel capability
of ndctl create-namespace --reconfig tries to do this by
default if it can determine that all DIMM capacity is
referenced by the namespace being reconfigured. It will
otherwise fail to autolabel and remain in label-less mode if
it finds a DIMM contributes capacity to more than one region.
This check prevents inadvertent data loss of that other region
is in active use. The --autolabel option is implied by
default, the --no-autolabel option can be used to disable this
behavior. When automatic labeling fails and labelled operation
is still desired the safety policy can be bypassed by the
following commands, note that all data on all regions is
forfeited by running these commands:
ndctl disable-region all
ndctl init-labels all
ndctl enable-region all
Emit debug messages for the namespace creation process
A regionX device name, or a region id number. Restrict the operation to
the specified region(s). The keyword all can be specified to indicate
the lack of any restriction, however this is the same as not supplying a
--region option at all.
A bus id number, or a provider string (e.g. "ACPI.NFIT"). Restrict the
operation to the specified bus(es). The keyword all can be specified
to indicate the lack of any restriction, however this is the same as
not supplying a --bus option at all.
Copyright (c) 2016 - 2019, Intel Corporation. License GPLv2: GNU GPL
<URL: http://gnu.org/licenses/gpl.html >.
This is free software:
you are free to change and redistribute it. There is NO WARRANTY, to
the extent permitted by law.
UEFI NVDIMM Label Protocol <URL: http://www.uefi.org/sites/default/files/resources/UEFI_Spec_2_7.pdf >
Linux Persistent Memory Wiki <URL: https://nvdimm.wiki.kernel.org >