CVPR2025 | MobileMamba: A new breakthrough in lightweight Mamba network, taking into account multiple receptive fields, efficient reasoning and super precision

CVPR2025 | MobileMamba: A new breakthrough in lightweight Mamba network, taking into account multiple receptive fields, efficient reasoning and super precision

1. Overview at a glance

MobileMamba proposes a lightweight multi-receptive field visual Mamba network . Through a three-stage network design and the MRFFI (Multi-Receptive Field Feature Interaction) module, it improves the model inference speed while achieving higher accuracy, surpassing the existing CNN, ViT and Mamba structures.

2. Core Issues

The current lightweight visual models are mainly based on CNN and Transformer:

CNN’s local receptive field limits its global modeling capabilities.

Transformer has a global receptive field, but the computational complexity is high at high resolution ( O(N²) ).

The existing Mamba lightweight model has low FLOPs but slow inference speed .

MobileMamba aims to:

Optimize the inference speed of Mamba to improve the throughput while ensuring low FLOPs.

Enhance multi-scale receptive field interaction , taking into account both long- and short-range feature capture and high-frequency detail extraction.

Adapt to high-resolution tasks and improve performance in tasks such as classification, object detection, and semantic segmentation.

3. Technical highlights

(1) Three-stage network design

• By weighing the trade-offs between four-stage and three-stage networks, choose a three-stage architecture to improve accuracy at the same throughput , or improve throughput at the same accuracy .

(2) MRFFI (Multi-Receptive Field Feature Interaction) module

WTE-Mamba (Long-range Wavelet Transform Enhanced Mamba) : combines global modeling with high-frequency edge information extraction.

MK-DeConv (Multi-core Deep Convolution) : Extract information of different scales and enhance local receptive field.

Eliminate Redundant Identity : Reduce channel redundancy and improve computing efficiency.

(3) Training & Testing Strategy Optimization

Knowledge Distillation improves the learning ability of lightweight models.

Extended Training Epochs further improves the upper limit of accuracy.

Normalization Layer Fusion accelerates inference at test time.

4. Methodological framework

picture

MobileMamba optimizes inference and feature extraction through the following core steps:

(1) Multi-receptive field feature interaction (MRFFI)

Long-range information is extracted through WTE-Mamba , while high-frequency features are enhanced by combining wavelet transform.

MK-DeConv uses convolution kernels of different sizes to interact local information and improve multi-scale perception capabilities.

Reduce computational cost and improve inference speed by eliminating redundant identity mappings .

(2) Lightweight Mamba structure

• A three-stage design is used to reduce the amount of computation and improve throughput.

• Combine multi-directional scanning and low-rank state space mapping to improve computational efficiency.

(3) Optimizing training and inference

Knowledge distillation : Learn from stronger teacher models to improve small model performance.

Extend the number of training rounds : Experiments have shown that 300 rounds did not fully converge, and extending it to 1000 rounds can improve accuracy.

Normalization layer fusion : reduces computational redundancy and improves computational efficiency during inference.

5. Quick Overview of Experimental Results

picture

MobileMamba demonstrates superior performance in multiple benchmark tests:

ImageNet-1K classification

MobileMamba-B4 83.6% Top-1 , +1.8% improvement over EfficientVMamba , and ×3.5 times faster inference speed .

✅Object Detection (COCO)

Mask R-CNN : Compared with EMO, it improves mAP by +1.3↑ and throughput by +57%↑ .

RetinaNet : Improves mAP by +2.1↑ and inference speed by ×4.3 times compared to EfficientVMamba .

✅Semantic Segmentation (ADE20K)

Semantic FPN : Improves mIoU by +1.1↑ compared to EdgeViT , with only 20% of FLOPs .

PSPNet : Improves mIoU by +0.4↑ compared to MobileViTv2 , with only 11% FLOPs .

6. Practical value and application

Edge device visual computing : suitable for resource-constrained scenarios such as smartphones, embedded devices, and the Internet of Things (IoT).

Autonomous driving and monitoring : Provides efficient visual computing in high-resolution scenarios , suitable for target detection and segmentation tasks.

Medical image analysis : Extract key medical image features through multi-receptive field characteristics to improve diagnostic efficiency .

7. Open Questions

Is MobileMamba’s multi-receptive field feature interaction strategy applicable to other tasks such as video understanding or 3D vision?

How to further optimize MobileMamba to improve CPU/mobile inference speed?

Can we combine LoRA or other efficient parameter fine-tuning methods to improve the adaptability of MobileMamba for specific tasks?

<<: 

>>:  Required course: VLAN is so important! Share VLAN planning and configuration examples in two most common scenarios!

Recommend

By 2027, global 5G IoT roaming connections will reach 142 million

A study conducted by Juniper Research reveals pro...

13 key technical differences between SD-WAN providers

Choosing the right software-defined WAN vendor ca...

Changes to the Internet in 2018

There are already many articles in the industry p...

5G is here, and these threats are lurking...

5G is coming, and it will be possible to experien...

China Unicom has opened 478,000 5G base stations

As an important part of my country's "ne...

Do these 3 steps well and data center migration can be done without interruption

If you ask network engineers what issues keep the...

The industry chain works together to make great progress in 5G messaging

2020 is a critical year for my country's 5G c...

TCP protocol status analysis, super complete~

Today I will talk to you about the state analysis...

How to ensure the reliability and number of nodes in CAN network communication

In CAN-bus circuit design, the transceiver can th...