criu-amdgpu-plugin - Man Page
A plugin extension to CRIU to support checkpoint/restore in userspace for AMD GPUs.
Current Support
Single and Multi GPU systems (Gfx9) Checkpoint / Restore on different system Checkpoint / Restore inside a docker container Pytorch Tensorflow Using CRIU Image Streamer
Description
Though criu is a great tool for checkpointing and restoring running applications, it has certain limitations such as it cannot handle applications that have device files open. In order to support ROCm based workloads with criu we need to augment criu’s core functionality with a plugin based extension mechanism. criu-amdgpu-plugin provides the necessary support to criu to allow Checkpoint / Restore with ROCm.
Dependencies
amdkfd support
In order to snapshot the VRAM and other GPU device states, we require an updated version of amdkfd(amdgpu) driver.
Options
Optional parameters can be passed in as environment variables before executing criu command.
- KFD_FW_VER_CHECK
Enable or disable firmware version check. If enabled, firmware version on restored gpu needs to be greater than or equal firmware version on checkpointed GPU. Default:Enabled
E.g: KFD_FW_VER_CHECK=0
- KFD_SDMA_FW_VER_CHECK
Enable or disable SDMA firmware version check. If enabled, SDMA firmware version on restored gpu needs to be greater than or equal firmware version on checkpointed GPU. Default:Enabled
E.g: KFD_SDMA_FW_VER_CHECK=0
- KFD_CACHES_COUNT_CHECK
Enable or disable caches count check. If enabled, the caches count on restored GPU needs to be greater than or equal caches count on checkpointed GPU. Default:Enabled
E.g: KFD_CACHES_COUNT_CHECK=0
- KFD_NUM_GWS_CHECK
Enable or disable num_gws check. If enabled, the num_gws on restored GPU needs to be greater than or equal num_gws on checkpointed GPU. Default:Enabled
E.g: KFD_NUM_GWS_CHECK=0
- KFD_VRAM_SIZE_CHECK
Enable or disable VRAM size check. If enabled, the VRAM size on restored GPU needs to be greater than or equal VRAM size on checkpointed GPU. Default:Enabled
E.g: KFD_VRAM_SIZE_CHECK=0
- KFD_NUMA_CHECK
Enable or disable NUMA CPU region check. If enabled, the plugin will restore GPUs that belong to one CPU NUMA region to the same CPU NUMA region. Default:Enabled
E.g: KFD_NUMA_CHECK=1
- KFD_CAPABILITY_CHECK
Enable or disable capability check. If enabled, the capability on restored GPU needs to be equal to the capability on the checkpointed GPU. Default:Enabled
E.g: KFD_CAPABILITY_CHECK=1
- KFD_MAX_BUFFER_SIZE
On some systems, VRAM sizes may exceed RAM sizes, and so buffers for dumping and restoring VRAM may be unable to fit. Set to a nonzero value (in bytes) to set a limit on the plugin’s memory usage. Default:0 (Disabled)
E.g: KFD_MAX_BUFFER_SIZE="2G"
Author
The AMDKFD team.
Copyright
Copyright (C) 2020-2021, Advanced Micro Devices, Inc. (AMD)