Skip to content

Conversation

Copy link

Copilot AI commented Jan 21, 2026

Summary

This PR implements per-model CPU execution capabilities limited to text encoder submodels only, providing VRAM savings while maintaining optimal performance.

Scope: When CPU-only is enabled for a main model, only the text encoder submodels (TextEncoder, TextEncoder2, TextEncoder3) run on CPU. The denoiser and other submodels always run on GPU for optimal performance. Conditioning tensors are automatically moved to CPU for storage and to GPU when needed by the denoiser.

Frontend Changes:

  • Simplified CPU Only toggle to a single switch control (removed dual toggle UI)
  • Fixed TypeScript compilation errors for cpu_only field in MainModelDefaultSettings
  • Added 'cpuOnly' to Feature type union
  • Updated popover translation to clarify that only text encoders run on CPU, not the entire model

Backend Changes:

  • Modified _get_execution_device() to check submodel_type parameter - CPU execution only applied to text encoder submodels
  • Added cpu_only field to standalone text encoder configurations (T5Encoder, Qwen3Encoder variants) to support future frontend UI
  • Updated model loader to respect cpu_only for both main model text encoder submodels and standalone encoders
  • All text encoder invocations now explicitly move conditioning tensors to CPU for storage (saves VRAM)

Critical Bug Fixes:

  • Fixed model cache device placement bug: Text encoders were loading onto CUDA even when CPU-only was enabled. Root cause: lock() method checked cache's default _execution_device instead of model-specific compute_device. Fixed by adding compute_device property to cached model classes and updating lock() to check model-specific device setting.
  • Fixed device mismatch error where text encoders running on CPU received inputs on CUDA by updating all text encoder invocations to use text_encoder.device instead of TorchDevice.choose_torch_device():
    • compel.py (SD1.5 and SDXL variants)
    • sd3_text_encoder.py (CLIP and T5 encoders)
    • flux_text_encoder.py (CLIP and T5 encoders)
    • cogview4_text_encoder.py (GLM encoder)
    • z_image_text_encoder.py (Qwen3 encoder)
  • Updated log messages to display actual model device instead of cache default
  • Ensures input tensors are created on the same device as the model, preventing RuntimeError
  • Conditioning tensors moved to CPU for storage, automatically transferred to GPU when used by denoiser

Related Issues / Discussions

Part of #44 - Per-model CPU execution feature

QA Instructions

  1. Frontend Linting: Run pnpm lint:tsc in invokeai/frontend/web - should pass without errors
  2. Main Models Tab: Navigate to Model Manager → Main Models tab - should load without errors
  3. CPU Only UI: Select a model → Default Settings tab → verify single "Run model on CPU only" toggle displays with updated help popover explaining text encoder scope
  4. CPU Execution: Enable CPU-only for a model and generate an image - should work without device mismatch errors
  5. VRAM Savings: Monitor VRAM usage - text encoders should run on CPU while denoiser runs on GPU
  6. Log Verification: Check logs during generation - text encoder models should show "onto cpu device" not "onto cuda device" when CPU-only is enabled
  7. Performance: Verify generation performance is maintained (only text encoding slightly slower, denoising at full GPU speed)
  8. Model Type Coverage: Test with SD1.5, SDXL, Z-Image, FLUX, and SD3 models to verify CPU-only works across all model types
  9. Standalone Encoders: Verify T5 and Qwen3 standalone encoders respect cpu_only configuration in their model configs (backend support ready for future frontend UI)

Merge Plan

N/A

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
@lstein lstein marked this pull request as ready for review January 21, 2026 05:28
@lstein lstein self-requested a review as a code owner January 21, 2026 05:28
Copilot AI changed the title [WIP] Add per-model CPU execution configuration Fix TypeScript type definitions for cpu_only field Jan 21, 2026
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Copy link
Author

Copilot AI commented Jan 21, 2026

@copilot When I try to go to the main models tab, I get this error:

...

Fixed the runtime error by adding the missing popover translation for the cpuOnly feature. The InformationalPopover component was trying to call .map() on undefined paragraphs because the translation entry was missing.

Added translation with heading "CPU Only" and two paragraph descriptions explaining the feature. Commit: 02766a0

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Copilot AI changed the title Fix TypeScript type definitions for cpu_only field Fix frontend types and translations for cpu_only field Jan 21, 2026
@lstein
Copy link
Owner

lstein commented Jan 21, 2026

@copilot Great! A couple of things:

  1. Could you change the UI so that there is only a single slider button needed to turn on Run model on CPU only?
  2. Add the CPU only option to the text encoders "T5 Encoder" and "Qwen3 Encoder" and make the appropriate backend changes.

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Copy link
Author

Copilot AI commented Jan 21, 2026

@copilot Great! A couple of things:

  1. Could you change the UI so that there is only a single slider button needed to turn on Run model on CPU only?
  2. Add the CPU only option to the text encoders "T5...

Completed both requests:

  1. Simplified UI: Changed to a single switch toggle (removed the separate enable/disable toggle). The switch now controls both enabling the setting and its value.

  2. Added encoder support: Added cpu_only field to all T5Encoder and Qwen3Encoder config classes, and updated the model loader to respect it.

  3. Fixed critical device mismatch bug: The main issue was that text encoders were using TorchDevice.choose_torch_device() which always returns CUDA, even when the model is on CPU. Fixed by using text_encoder.device instead in:

    • compel.py (2 instances)
    • sd3_text_encoder.py (2 instances)
    • cogview4_text_encoder.py
    • z_image_text_encoder.py

This ensures inputs are created on the same device as the model, preventing the RuntimeError you encountered. Commit: 918d21f

Copilot AI changed the title Fix frontend types and translations for cpu_only field [Feature] Per-model CPU execution with encoder support and device fixes Jan 21, 2026
@lstein
Copy link
Owner

lstein commented Jan 21, 2026

@copilot Thank you for your work on this. There are still tensor device mismatch errors, but I've also rethought the scope of the work. I think that the CPU-only option should be limited to standalone text encoders and to the text encoder sub-model of Main models. The core issue is that the conditioning tensors returned by a text encoder running on CPU needs to be moved onto the GPU in order to be used as the input to the next phase of the pipeline, usually the denoiser. So this is a special case that needs to be specifically handled. So could you change the code such that when a Main model is marked CPU-only, only the text encoder component is executed on CPU and its conditioning outputs are moved to GPU before passing to the denoiser? Similarly, the standalone encoders should have a CPU-only option in the backend and frontend, and their outputs should be handled similarly.

Let me know if you see any problems with this approach.

…moved to CPU for storage

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Copy link
Author

Copilot AI commented Jan 21, 2026

@copilot Thank you for your work on this. There are still tensor device mismatch errors, but I've also rethought the scope of the work. I think that the CPU-only option should be limited to standalone...

Implemented the scope change to limit CPU-only execution to text encoders only:

Backend changes:

  1. Modified _get_execution_device() to check submodel_type - CPU execution only applies to TextEncoder, TextEncoder2, and TextEncoder3 submodels
  2. Denoisers, VAEs, and other submodels always use the default device (GPU)
  3. Added explicit .detach().to("cpu") to all text encoder invocations (SD3, FLUX, CogView4, Z-Image) to move conditioning tensors to CPU for storage
  4. Standalone encoders (T5, Qwen3) keep their direct cpu_only field support

Frontend changes:
Updated popover text to clarify that only text encoders run on CPU, not the entire model.

How it works:

  • Text encoders run on CPU when enabled (saves VRAM)
  • Conditioning tensors moved to CPU for storage
  • When the denoiser needs them, they're automatically moved to GPU
  • Denoiser always runs on GPU for optimal performance

This approach balances VRAM savings with performance. Commit: e850b78

Copilot AI changed the title [Feature] Per-model CPU execution with encoder support and device fixes [Feature] CPU execution for text encoders with automatic device management Jan 21, 2026
@lstein
Copy link
Owner

lstein commented Jan 21, 2026

@copilot I'm not sure this is working as expected. I selected the diffusers Z-Image Turbo Main model in the Model Manager tab, and selected "Run model on CPU only." I then started a generation using this model. But prior to denoising, the log shows:

[2026-01-21 10:32:54,838]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '9897a18f-56a8-4b83-a30c-573e8326799d:text_encoder' (Qwen3Model) onto cuda device in 1.06s. Total model size: 7672.25MB, VRAM: 7672.25MB (100.0%)
[2026-01-21 10:32:54,839]::[ModelManagerService]::INFO --> [MODEL CACHE] Loaded model '9897a18f-56a8-4b83-a30c-573e8326799d:tokenizer' (Qwen2Tokenizer) onto cuda device in 0.00s. Total model size: 0.00MB, VRAM: 0.00MB (0.0%)

I don't expect to see the message about the Qwen3Model being loaded into CUDA.

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Copy link
Author

Copilot AI commented Jan 21, 2026

@copilot I'm not sure this is working as expected. I selected the diffusers Z-Image Turbo Main model in the Model Manager tab, and selected "Run model on CPU only." I then started a generation using t...

Fixed the issue. The problem was in the lock() method of the model cache - it was only checking the cache's default _execution_device (CUDA) instead of the model-specific compute_device that was set when adding the model to the cache.

Changes made:

  1. Added compute_device property to both cached model wrapper classes
  2. Modified lock() to check the model's specific compute_device before loading to VRAM
  3. Models configured for CPU now skip VRAM loading entirely
  4. Updated log message to display the model's actual device

Text encoders with cpu_only=True will now correctly load on CPU and log messages will show "onto cpu device" instead of "onto cuda device". This applies to both main model text encoders (SD1.5, SDXL, Z-Image, etc.) and standalone encoders. Commit: 18532b8

Copy link
Owner

@lstein lstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working as expected. Will correct linting errors after merge with lstein/per-model-cpu-execution branch.

@lstein lstein merged commit 7ca2716 into lstein/per-model-cpu-execution Jan 21, 2026
10 of 13 checks passed
@lstein lstein deleted the copilot/sub-pr-46 branch January 21, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants