Manage Partitions

Network partitions represent logical groupings of GPUs that reside within the same network domain. Partitions can be created based on one of two member types:

  • GPU-ID-based: A list of unique GPU identifiers.
  • Location-based: A set of objects describing each GPU’s physical placement, including attributes such as domain, chassis, slot, and host.

A partition’s member type is fixed at creation and cannot be changed or updated. All subsequent operations—such as updates or read requests—must use the same member type. For example, attempting to update a location-based partition using GPU IDs will result in a 409 Conflict error.

GPUs can only belong to one partition; all GPUs within a partition must belong to the same network domain.

Partition API Endpoints

Use the /v1/partitions endpoint to create, update, view, or delete a partition. Most API responses return an operation ID, which you can use to query the status of the request:

Endpoint Description
GET /nmx/v1/partitions Retrieve a list of partitions
POST /nmx/v1/partitions Create a partition. The request body must include a partition name and a members object, which is either GPU-ID-based or location-based
GET /nmx/v1/partitions/{id} Retrieve partition information, including health and metadata
PUT /nmx/v1/partitions/{id} Update a partition. Note that the partition name cannot be modified. However, you can update its member list. When performing a PUT operation, the members parameter must include all GPUs that will belong to the partition. The system compares the provided list with the current configuration and adds or removes members automatically.
DELETE /nmx/v1/support-packages/{id} Delete a partition

Monitor Partition Health

Depending on the resiliency mode for the partition, the partition can enter one of the following health states:

Resiliency Mode State Description
Full-Bandwidth Mode HEALTHY Operates at full bandwidth and full compute capacity. This is the optimal state.
DEGRADED Some GPUs may be parked with a NO_NVLINK health status. Remaining GPUs operate at full bandwidth; the partition remains operational.
UNHEALTHY Internal failures render the partition non-operational.
Adaptive-Bandwidth Mode HEALTHY Runs at full bandwidth and full compute capacity. This is the optimal state.
BANDWIDTH Some trunk links are unavailable, reducing bandwidth. All GPUs can still communicate; considered operational.
UNHEALTHY Internal failures render the partition non-operational.
User-Action-Required Mode HEALTHY Operates at full bandwidth and full compute capacity. This is the optimal state.
DEGRADED_BANDWIDTH Missing trunk links reduce communication bandwidth. All GPUs can still communicate; considered operational.
UNHEALTHY Internal failures render the partition non-operational.