Skip to content

Conversation

@gingerXue
Copy link

Issue Type

  • Improvement/feature implementation

Runtime Environment

  • Operating system and version: Ubuntu 22.04.4 LT
  • Terminal emulator and version: xterm-256color
  • Python version: 3.10.12
  • NVML version (driver version): N/A
  • MTML version: 2.2.0
  • nvitop version or commit: 1.6.2.dev4+g31792dd
  • mthreads-ml-py version: 2.2.0
  • Locale: C.UTF-8

Description

This PR adds Mthreads GPU (mtml) support to nvitop, enabling basic GPU monitoring on platforms where mtml is available. We developed a wrapper layer for mthreads-ml-py which it can use nvml methods to avoid too many changes in this project.

The implementation is designed to be non-intrusive and fully backward compatible with existing NVML-based workflows.


Motivation and Context

nvitop currently relies on NVIDIA NVML, which makes it unusable on systems equipped with MTGPU devices.
In such environments, users lack a lightweight, top-like GPU monitoring tool.

This PR aims to:

  • Extend nvitop to support MTGPU-based platforms
  • Preserve existing behavior on NVIDIA GPUs
  • Minimize impact on the current code structure

Design & Implementation

  • Introduced a new backend based on mtml, parallel to the existing NVML backend
  • Runtime detection is used to select the appropriate backend:
    • nvml → NVIDIA GPUs
    • mtml → MTGPU devices
  • Implemented a compatibility layer to map MTGPU APIs to nvitop's internal data structures
Currently Supported Features (MTGPU)
  • Driver Version
  • GPU device enumeration
  • Total / used memory reporting
  • Basic utilization metrics
  • Power usage
Not Yet Supported
  • MIG-related features
  • Processes enumeration and utilization
  • Cuda driver version information
  • Persistence Mode
  • Bus-Id infomation
  • Advanced performance counters (not available in mtml)

Testing

Tested on:

  • MTGPU platform with mtml

Manual test cases include:

  • nvitop startup and refresh
  • MTGpu information
  • Memory usage display
  • Mixed error handling when NVML is not present
basic api test
from nvitop import Device

count = Device.count()
print(f'There are {count} MUSA devices')
devices = Device.all()

for device in devices:
    processes = device.processes()
    sorted_pids = sorted(processes)
    
    print(device)
    print(f'  - Fan speed:       {device.fan_speed()}%')
    print(f'  - Temperature:     {device.temperature()}C')
    print(f'  - GPU utilization: {device.gpu_utilization()}%')
    print(f'  - Total memory:    {device.memory_total_human()}')
    print(f'  - Used memory:     {device.memory_used_human()}')
    print(f'  - Free memory:     {device.memory_free_human()}')
    print(f'  - Processes ({len(processes)}): {sorted_pids}')
    for pid in sorted_pids:
        print(f'    - {processes[pid]}')
    print('-' * 120)
There are 8 MUSA devices
PhysicalDevice(index=0, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     52C
  - GPU utilization: 0%
  - Total memory:    80.00GiB
  - Used memory:     78.88GiB
  - Free memory:     1148MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=1, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.63GiB
  - Free memory:     6519MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=2, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     67C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     71.03GiB
  - Free memory:     9187MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=3, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     59C
  - GPU utilization: 59%
  - Total memory:    80.00GiB
  - Used memory:     78.23GiB
  - Free memory:     1810MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=4, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     77C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     73.39GiB
  - Free memory:     6765MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=5, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     69C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.68GiB
  - Free memory:     7497MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=6, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     78C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     75.62GiB
  - Free memory:     4480MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------
PhysicalDevice(index=7, name='MTT S5000', total_memory=80.00GiB)
  - Fan speed:       0%
  - Temperature:     63C
  - GPU utilization: 99%
  - Total memory:    80.00GiB
  - Used memory:     72.48GiB
  - Free memory:     7702MiB
  - Processes (0): []
------------------------------------------------------------------------------------------------------------------------

Future Work

  • Extend MTGPU metrics as mtml evolves
  • Add automated tests for backend selection
  • Improve feature parity where possible

Images / Videos

image

@gingerXue gingerXue closed this Dec 22, 2025
@gingerXue gingerXue reopened this Dec 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants