# arm

MEMSYS18

## Quantifying the Performance Overheads of PMDK

William Wang, Stephan Diestelhorst 2 October 2018

© 2018 Arm Limited

## **NVDIMMs Shipping in 2H'2018**



- 3DXP NVDIMM shipping 2H'2018
  - DDR4 -> \$10/GB, NAND -> \$1/4 per GB
  - 3DXP -> \$2/GB, up to 512GB
- Arm NVDIMM-ready timeframe around 2022

## **Multiple System Use-Cases for NVM**



arm

## **Spectrum of Application Performance Boost with NVM**

Persistent Memory

**Faster Storage** 

#### **Executive summary**

|                                     | NVMe Flash SSD                   | NVMe Flash SSD                    | PMEM + Flash                  | PMEM-only                       |
|-------------------------------------|----------------------------------|-----------------------------------|-------------------------------|---------------------------------|
| Config→                             | AOF every <u>second</u><br>+ RDB | AOF every <u>request</u><br>+ RDB | AOF on PMEM<br>+ RDB on Flash | SW persistence<br><u>no</u> RDB |
| Data loss potential                 | Up to 1 sec worth                | None                              | None                          | None                            |
| Crash restart                       | Slowest                          | Slowest                           | Fast                          | Instantaneous                   |
| Application rewrite                 | No                               | No                                | No                            | Yes                             |
| "Memory" cost^                      | lx                               | 1x                                | 1.25x <sup>§</sup>            | 0.27x <sup>+</sup>              |
| Iso-SLA performance( <sup>‡</sup> ) | 1x                               | 0.165x (0.149x)                   | 1.46x (1.31x)                 | 2.18x (1.96x)                   |

Baseline



- **13x** perf boost, **1/4** mem cost
- No data loss
- Instantaneous restart
- Rewrite application

^ Flash costs are marginal for all configurations, thus ignored.

§ Assumes 8GB/core DRAM + 1 GB/core SCM capacity – SCM capacity very generous for log use-case, previous studies used as low as 40MB/core.

+ Assumes 8GB/core DRAM capacity, SCM = 0.25x DRAM \$/GB, 1:8 capacity ratio for DRAM:SCM cache ratio.

‡ Measured performance de-rated by 0.9x to account for a slower-than-DRAM SCM media fronted by a DRAM cache.

\* Unable to achieve same SLA as baseline with p99 for the every-request AOF logging configuration – hence the unconstrained SLA comparison.

4 Confidential © 2018 Arm Limited

arm

Slide Courtesy of Kshitij Sudan

## **Why Application Rewrite**



## Recovery can inspect the data-structures in PM to restore system to a consistent state

- Crash consistency (failure atomicity) is needed to ensure recovery can restore system to a consistent state
  - Data move through volatile memories before they get written to PM
  - Using CPU cache flushes and fence instructions

### **Example: Add a Node to a Linked List**





## **Example: Add Concurrency to the Mix**







#### What can still go wrong?

Fences are missing after persists, they can get persisted out of order!!!

#### What about making this lock-free?





- Oracle Labs at PPoPP'18

Source: M. Friedman, A Persistent Lock-Free Queue for Non-Volatile Memory, PPoPP'18



## **Persistent Memory Programming Models**

#### **Native Persistence**

```
pt->x = 1;
pt->y = 1;
dccvap(&pt->x)
dccvap(&pt->y)
dsb
```

flag=1; dccvap(&flag) dsb

© 2018 Arm Limited

```
Library Persistence – Atomic
pt - x = I;
pt - y = I;
pmem persist(&pt,
sizeof(pt))
flag = I;
pmem persist(&flag,
sizeof(flag))
```

Programming simpler, overhead higher

**Library Persistence – Durable TXs** 

TX BEGIN{

pt->x = 1;

pt->y = 1;

TX END

## **Add Concurrency to the Mix**

#### Lib Persistence – Lock-free

| 1        | <pre>// ctor, dtor, copy</pre> | y and assign                      |      |
|----------|--------------------------------|-----------------------------------|------|
| 2        | struct Node                    |                                   |      |
| 3        | {                              |                                   |      |
| 4        | int x;                         |                                   |      |
| 5        | int y;                         |                                   |      |
| 6        | };                             |                                   |      |
| 7        |                                |                                   |      |
| 8        | <pre>//lock-free update</pre>  |                                   |      |
| 9        | <pre>int updateNode(Node</pre> | e* pt, <i>int</i> flag)           |      |
| 10       | {                              |                                   |      |
| 11       | // allocate a i                |                                   |      |
| 12       |                                | newpt = new Node();               |      |
| 13       | <pre>persist(newpt);</pre>     | ;                                 |      |
| 14       |                                |                                   |      |
| 15       | <pre>// update the </pre>      | node                              |      |
| 16       | newpt->x = 1;                  |                                   |      |
| 17       | <pre>newpt-&gt;y = 1;</pre>    |                                   |      |
| 18<br>19 | persist(&newpt-                |                                   |      |
| 20       | persist(&newpt-                | ->y);                             |      |
| 20       |                                | pinter to the new Node and update | f1 a |
| 22       | while(true) {                  | Sincer to the new Node and update | i ta |
| 22       | // copy on                     | write                             |      |
| 23       |                                | e*> cowpt = pt;                   |      |
| 25       |                                | ε*> cowpt = pt,                   |      |
| 26       | if(CAS(&nt                     | , cowpt, newpt)) {                |      |
| 27       | persis                         |                                   |      |
| 28       |                                | contention for flag               |      |
| 29       | flag =                         |                                   |      |
| 30       |                                | t(&flag);                         |      |
| 31       | return                         |                                   |      |
| 32       | }                              |                                   |      |
| 33       | }                              |                                   |      |
| 34       | }                              |                                   |      |
|          |                                |                                   |      |



Lib Persistence – Durable TXs

TX\_BEGIN{ pt->x = 1; pt->y = 1; } TX\_END

## **Durable Transactions**

- TXs provide clean failure semantics
- Accelerate w. Persistent HTM



- Durable transactions guarantee Tx failure atomicity
  - Ordered persists + logging



## **PMDK (Persistent Memory Development Kit)**

#### Formally NVML, 'pmem libraries'

- PMDK provides transactional APIs for persistent memory programming
  - libpmemobj transactional APIs
  - Use fine-grained logging and cache flushes
- Works on 64-bit Linux, Windows and 64-bit FreeBSD
- Supports x86 and Armv8 (experimental)



Ref: pmem.io

## Analyze Cost of Durable Transactions

Logging, Persisting and Ordering

## **Cost of Durable Transactions**

Logging, Persisting and Ordering

#### Persisting

- Latency too high if PoP is offchip
- Undo log needs to be persisted in TX
- Instruction bloat due to looping through addresses with DCCVAP

#### Logging

- Write amplification, i.e., write twice
- Copying (costly, and cache pollution)
- Barriers with undo logging
- Memory fragmentation
- Code bloat due to logging

#### Ordering

- Serialize memory accesses
- Excessive # of barriers via API calls
- Persisting barriers same as ordering barriers

### **Performance Measurement Setup**





## Flushing, Logging and Fencing Overhead



- Logging overhead 0~35%
- Flushing overhead 14~53%
- Fencing overhead 2~5%

#### Baseline: PMDK without flushing/fencing and logging on



## Software Opt Can Reduce Logging Overhead



- Software opt can reduce logging overhead
  - NVML v1.3 (left) reduces logging overhead over NVML v1.0 (right)
  - Redis-SET performance not affected by logging overhead

#### Baseline: PMDK without flushing/fencing and logging on

#### arm

## **Summary of Performance Overheads Analysis**

- Flushing overhead up to 53%.
- Logging overhead up to 35%.
- Fencing overhead up to 5%
- These overheads can be reduced by
  - Software optimizations
  - Hardware accelerations

## **Ease-of-Programming vs Performance**



Ease-of-programming

| Programming<br>Model | Ordering     | Logging                    | Persisting   | Translation  |
|----------------------|--------------|----------------------------|--------------|--------------|
| Atomics              | $\checkmark$ | Х                          | $\checkmark$ | $\checkmark$ |
| ТМ                   | $\checkmark$ | $\checkmark$               | $\checkmark$ | $\checkmark$ |
| Locking              | $\checkmark$ | $\checkmark$ (log elision) | $\checkmark$ | $\checkmark$ |

#### Ease-of-Programming

Persistent CAS as drop-in replacement for CAS Persistent HTM as drop-in replacement for HTM Compiler instrumented failure atomicity for locking

#### Performance

Relaxed memory persistency model Hardware accelerations Software optimizations



Thank You! Danke! Merci! 谢谢! ありがとう! **Gracias!** Kiitos! 감사합니다 धन्यवाद

