Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

AI by Beto

Semantic Segmentation

Multi-Modal

State Space Models

Computer Vision

Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Sigma marks a significant step forward in multi-modal semantic segmentation, enabling AI agents to better understand complex scenes, especially in suboptimal conditions such as low-light or overexposure. The Selective Structured State Space Model, referred to as Mamba, used in Sigma, sets it apart from conventional models by covering global receptive fields with linear complexity rather than the quadratic complexity typically seen in Vision Transformers (ViTs).

Sigma employs a Siamese encoder to cherry-pick vital information across different modalities like RGB, thermal, and depth.
A novel Mamba fusion mechanism is key to this selective process.
The model includes a specially designed decoder to boost channel-wise modeling efficiency.
Sigma has surpassed other models in RGB-Thermal and RGB-Depth segmentation tasks during evaluation, marking this the inaugural success using State Space Models (SSMs) in multi-modal tasks.

The project’s GitHub repository provides codes and resources for further exploration.

Sigma’s approach addresses the cumbersome complexity issue in multi-modal segmentation, offering a scalable and robust solution. It opens doors for future research into SSM applications across diverse AI and computer vision challenges.

Personalized AI news from scientific papers.