NVMe 1.4 Specification Published: Further Optimizing Performance and Reliability
by Billy Tallis on June 14, 2019 1:00 PM ESTNVM Sets and Endurance Groups
NVM Sets and Endurance Groups are two new high-level organizational constructs for managing pools of storage that are larger than an individual NVMe namespace. Since support for multiple namespaces is itself largely reserved for high-end enterprise SSDs, some of the features that depend on NVM Sets or Endurance Groups only make sense for multi-port drives, virtualized environments or NVMe over Fabrics arrays—situations where one NVMe controller is behaving like or providing access to multiple drives. However, some of these new features are still useful even on drives that only provide a single endurance group containing a single NVM set.
An Endurance Group is a collection of NVM Sets, which consist of namespaces and unallocated storage. Each endurance group is a separate pool of storage for wear leveling purposes. They have their own dedicated pool of spare blocks, and the drive reports separate wear statistics for each endurance group. On drives with more than one endurance group, it will be possible to completely wear out one endurance group and cause it to go read-only while other endurance groups remain usable.
Marvell's dual-chip NVMe controller architecture, a great candidate for multiple endurance groups
A drive can be designed to map specific NAND dies or channels to different NVM sets or endurance groups, essentially splitting it into multiple relatively independent drives. This can not only provide for separation of wearout, but also rigidly partitioning performance. Cloud hosting providers can put VMs from separate customers on different NVM sets or endurance groups to ensure that a busy workload from one customer doesn't hurt the latency experienced by another.
Predictable Latency Mode
The new Predictable Latency Mode feature allows the host to temporarily pause any background work the SSD controller is performing, ensuring that there is nothing to get in the way of immediately processing new IO commands arriving from the host system. This allows drives to offer their best-case and most consistent performance. SSDs cannot operate in this mode indefinitely, and will eventually need to leave deterministic mode to catch up on background work. The drive provides running estimates of how long it can remain in a deterministic window before it will have to switch itself back to non-deterministic performance, in terms of time and in terms of how many more 4kB random reads and optimal-sized writes it can handle deterministically.
Predictable Latency Mode will typically be used in environments where the host software can load-balance across multiple drives. High-priority IO can be directed to drives that are currently in a deterministic window, and drives that are in a non-deterministic window can either be left alone to catch up on background work, or used to handle low-priority IO. Each individual drive in the pool will alternate between deterministic and non-deterministic operation, and the timing of these windows will depend on the workload. If the load balancer is working well, it will stop issuing latency-sensitive IO to a drive and manually take it out of deterministic mode before the drive reaches its limit. Drives can be configured to provide a warning at a customizable threshold before the limits are reached, so host systems won't need to constantly check the status indicators to see whether the drive is close to leaving its deterministic window. Predictable Latency Mode is configured on a per-NVM Set basis, so drives that provide multiple sets can have some in each mode at any given time and perform load balancing between NVM Sets.
This isn't the first NVMe feature for controlling when the drive performs background work: NVMe 1.3 added the Non-Operational Power State Permissive Mode feature so that host systems could ask drives not do background work when in a low-power idle state. The intent here is to defer background work while running on battery power or with the system's fans off, and this feature does not affect background work while the drive is in an active power state.
Submission Queue Associations
NVMe SSDs typically support multiple command submission and completion queues. Currently, the main use case for this is to give each CPU core its own queues so that no core-to-core synchronization is necessary at the driver level to perform ordinary IO. There's also been recent work on the Linux NVMe driver to add support for using queues for specialized purposes, such as having dedicated queues for high-priority commands that will be polled for completion instead of waiting for an interrupt, or having separate queues per core for read and write commands. NVMe 1.4 and Predictable Latency Mode add another potential use case for multiple queues: associating a queue with a specific NVM Set. NVMe 1.4 allows the host system to inform the SSD of which NVM Set it plans to use each queue with, which can give the SSD controller the opportunity to further reduce latency or improve QoS when using Predictable Latency Mode. This feature is optional even for drives that support Predictable Latency Mode.
Namespace Write Protection
The new Namespace Write Protection feature is pretty self-explanatory. An NVMe namespace can be put into one of three read-only modes: read-only until the next power cycle, read-only until the first power cycle after the write protect feature is disabled, or permanently read-only for the lifetime of the drive. This provides a range of options for protecting critical data in embedded or high-security systems.
A typical use case will be to put the OS or a minimal recovery system on a write-protected namespace, and keep user data and apps on a regular read+write namespace. This is one of the first NVMe features to start providing a compelling case for client SSDs to support multiple namespaces, since protected operating systems or recovery partitions are already commonplace for both mobile and desktop operating systems. Write-protected namespaces can get the drive itself involved in helping protect the OS against accidental or malicious tampering, and this feature is much simpler than the existing Replay Protected Memory Block feature.
What's Next
Open Channel SSDs have been around in some form or another for several years. These expose more of the inner workings of the SSD and move all or part of the flash translation layer out of the drive and onto the host CPU. Several NVMe features have already been inspired in part by open channel SSDs, including the above hints about optimal block sizes and data alignment and the NVM Sets and Endurance Groups features. Last year Microsoft formed Project Denali to explore what is the best balance between the low-level control afforded by open channel SSDs and the easy to use abstraction of traditional block storage, with the ultimate goal of producing new standards that can be more broadly adopted than existing open channel efforts. That and related work is likely to continue influencing NVMe, but the general approach for NVMe has been to prefer exchanging hints rather than imposing hard restrictions on IO patterns.
(source: Western Digital presentation at USENIX Vault 2019)
In a similar vein, there is a Technical Proposal in the NVMe working group to implement a Zoned Namespace feature that borrows ideas from how Shingled Magnetic Recording hard drives are handled. Host-managed SMR hard drives have similar constraints to NAND flash in that they have large regions that can support random reads but not overwriting of existing data; for NAND flash these regions are the erase blocks. Software and filesystems that have already been modified to support host-managed SMR drives can easily support zoned SSDs. It's not at all trivial to modify software to work without random write support, but a lot of that work has already been done. Making use of it on SSDs can reduce write amplification, improve latency and cut down on how much DRAM an SSD needs.
(source: Samsung Key-Value SSD product brief)
Several vendors have been developing SSDs that present a key-value database interface instead of a traditional block device. Supporting this only adds a little bit of complexity to the SSD's flash translation layer but it cuts out a lot of redundant abstractions on the host software side, so performance can be surprisingly high compared to running a key-value database application on top of a filesystem on a traditional SSD. We will probably see a standard for key-value namespaces added to NVMe sooner rather than later.
Further down the road, there are hopes of producing standards for computational storage devices and accelerators/coprocessors that can build atop NVMe infrastructure. There are a lot of companies working on or already shipping devices to offload tasks like compression, encryption, searching and AI inferencing from CPUs and instead performing these computational tasks closer to where the data is stored. Work to standardize the interfaces to such devices is still in its infancy, but in two or three years when it's time for NVMe 1.5, some of these ideas may have matured enough to start yielding standards for the basic infrastructure around computational storage.
Source: NVM Express
14 Comments
View All Comments
Arbie - Friday, June 14, 2019 - link
Does this have any implications for motherboard hardware in the near future (six months or so)? Or in fact for any hardware not part of the SSD? From the text it seems like it doesn't, unless I missed something.Excellent article BTW; thanks.
dooferorg - Friday, June 14, 2019 - link
As with anything, technology always marches on. At least if a motherboard supports PCIe3, or soon PCIe4 then you'll be able to get appropriate cards to interface these newer drives with.Billy Tallis - Friday, June 14, 2019 - link
Motherboards are only minimally involved with NVMe; basically just for booting. Otherwise, they're just routing PCIe lanes from the CPU to the SSD and are blind to the higher-level protocols.bobhumplick - Saturday, June 15, 2019 - link
so does this mean that nvme 1.4 will be back compat with current boards if new drives are used? or will new hardware be needed? i know the other person basically asked that but i didnt fully understand if an answer had been given. seems like as long as the drive supports it as well as the ssd then it should work. unless the bios has to support it?cygnus1 - Monday, June 17, 2019 - link
As was mentioned, this is software change so I don't know that anything would be necessary depending on the changes in the protocol. But if anything is needed, I think a motherboard would just need to have a firmware update to enable NVMe 1.4 support. I haven't seen it said anywhere, but I would think as long as the device could boot and everything after that is just passed between the SSD and the OS, then the OS and the SSD could possibly just negotiate up to 1.4 even if the motherboard only supported 1.3 for boot purposes.CheapSushi - Friday, June 14, 2019 - link
I really like the idea of there being more variety for addon cards or storage drives that can act as accelerators or coprocessors.willis936 - Saturday, June 15, 2019 - link
Good stuff. I hope anyone who implements PMR also does periodic (like once a second) synchronization to storage. Otherwise it’s pointless to even bother sending the file across PCIe in the first place.boeush - Saturday, June 15, 2019 - link
You miss the point of PMR. It doesn't need to sync periodically, because it is automatically guaranteed persistence across power cuts. That's what makes it different from regular RAM, and provides the reason for sending the file across PCIe.willis936 - Saturday, June 15, 2019 - link
Backup power on drives is used to flush data that is cached in RAM to NAND. You can’t guarantee when power will be restored. If a design treats data that users care about as disposable then it’s a bad design. If it’s data the user doesn’t care about then why is it getting sent to storage at all? Main memory works fine. It costs very little to have the drive sync to NAND.extide - Sunday, June 16, 2019 - link
It syncs when the power goes out. It doesn't need to at any other time.