support mthreads gpu monitoring #198

gingerXue · 2025-12-19T08:18:45Z

Issue Type

Improvement/feature implementation

Runtime Environment

Operating system and version: Ubuntu 22.04.4 LT
Terminal emulator and version: xterm-256color
Python version: 3.10.12
NVML version (driver version): N/A
MTML version: 2.2.0
nvitop version or commit: 1.6.2.dev4+g31792dd
mthreads-ml-py version: 2.2.0
Locale: C.UTF-8

Description

This PR adds Mthreads GPU (mtml) support to nvitop, enabling basic GPU monitoring on platforms where mtml is available. We developed a wrapper layer for mthreads-ml-py which it can use nvml methods to avoid too many changes in this project.

The implementation is designed to be non-intrusive and fully backward compatible with existing NVML-based workflows.

Motivation and Context

nvitop currently relies on NVIDIA NVML, which makes it unusable on systems equipped with MTGPU devices.
In such environments, users lack a lightweight, top-like GPU monitoring tool.

This PR aims to:

Extend nvitop to support MTGPU-based platforms
Preserve existing behavior on NVIDIA GPUs
Minimize impact on the current code structure

Design & Implementation

Introduced a new backend based on mtml, parallel to the existing NVML backend
Runtime detection is used to select the appropriate backend:
- nvml → NVIDIA GPUs
- mtml → MTGPU devices
Implemented a compatibility layer to map MTGPU APIs to nvitop's internal data structures

Currently Supported Features (MTGPU)

Driver Version
GPU device enumeration
Total / used memory reporting
Basic utilization metrics
Power usage

Not Yet Supported

MIG-related features
Processes enumeration and utilization
Cuda driver version information
Persistence Mode
Bus-Id infomation
Advanced performance counters (not available in mtml)

Testing

Tested on:

MTGPU platform with mtml

Manual test cases include:

nvitop startup and refresh
MTGpu information
Memory usage display
Mixed error handling when NVML is not present

basic api test

from nvitop import Device

count = Device.count()
print(f'There are {count} MUSA devices')
devices = Device.all()

for device in devices:
    processes = device.processes()
    sorted_pids = sorted(processes)
    
    print(device)
    print(f'  - Fan speed:       {device.fan_speed()}%')
    print(f'  - Temperature:     {device.temperature()}C')
    print(f'  - GPU utilization: {device.gpu_utilization()}%')
    print(f'  - Total memory:    {device.memory_total_human()}')
    print(f'  - Used memory:     {device.memory_used_human()}')
    print(f'  - Free memory:     {device.memory_free_human()}')
    print(f'  - Processes ({len(processes)}): {sorted_pids}')
    for pid in sorted_pids:
        print(f'    - {processes[pid]}')
    print('-' * 120)

There are 8 MUSA devices
PhysicalDevice(index=0, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     52C
  - GPU utilization: 0%
  - Total memory:    80.00GiB
  - Used memory:     78.88GiB
  - Free memory:     1148MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=1, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.63GiB
  - Free memory:     6519MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=2, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     71.03GiB
  - Free memory:     9187MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=3, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     59C
  - GPU utilization: 59%
  - Total memory:    80.00GiB
  - Used memory:     78.23GiB
  - Free memory:     1810MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=4, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     77C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.39GiB
  - Free memory:     6765MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=5, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     69C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.68GiB
  - Free memory:     7497MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=6, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     78C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     75.62GiB
  - Free memory:     4480MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=7, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     63C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.48GiB
  - Free memory:     7702MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------

Future Work

Extend MTGPU metrics as mtml evolves
Add automated tests for backend selection
Improve feature parity where possible

Images / Videos

support mthreads-ml-py

a1c3f09

gingerXue force-pushed the feat/mtgpu-support branch from 272c161 to a1c3f09 Compare December 19, 2025 09:08

gingerXue closed this Dec 22, 2025

gingerXue reopened this Dec 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support mthreads gpu monitoring #198

support mthreads gpu monitoring #198

gingerXue commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

support mthreads gpu monitoring #198

Are you sure you want to change the base?

support mthreads gpu monitoring #198

Conversation

gingerXue commented Dec 19, 2025

Issue Type

Runtime Environment

Description

Motivation and Context

Design & Implementation

Currently Supported Features (MTGPU)

Not Yet Supported

Testing

basic api test

Future Work

Images / Videos

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants