GrootVL: Tree Topology is All You Need
in State Space Model
Abstract
The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency. However, constrained by the inherent geometric constraints of sequences, it still falls short in modeling long-range dependencies. To address this issue, we propose the GrootVL network, which first dynamically generates a tree topology based on spatial relationships and input features. Then, feature propagation is performed based on this graph, thereby breaking the original sequence constraints to achieve stronger representation capabilities. Additionally, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. GrootVL is a versatile multimodal framework that can be applied to both visual and textual tasks. Extensive experiments demonstrate that our method significantly outperforms existing structured state space models on image classification, object detection and segmentation. Besides, by fine-tuning large language models, our approach achieves consistent improvements in multiple textual tasks at minor training cost. Code is available at https://github.com/EasonXiao-888/GrootVL.
1 Introduction
Mainstream fundamental models are primarily based on CNN [27, 57, 41, 29, 13] and Transformer architectures [15, 40, 39, 54, 14], which dominate in visual and language tasks. However, the small receptive field of CNNs and the high complexity of Transformers make it challenging to strike a good balance between effectiveness and efficiency. The state space models (SSMs) [21, 23, 48] attempt to disrupt this impasse, which model sequences in a recurrent form. Different from the previous recurrent neural networks [28, 7], these approaches draw inspiration from control systems, leveraging structural parameter initialization to attain stable optimization and superior computing performance. Nevertheless, it remains susceptible to the intrinsic flaw shared by recurrent neural networks, , a deficiency in capturing long-range dependencies.
Recently, an improved selection mechanism known as Mamba [18] is proposed to mitigate the challenges of SSMs. This approach introduces weight modulation during the propagation process, which substantially enlarges the effective receptive field and achieves impressive performance in NLP tasks. Besides, numerous studies aim to extend Mamba into computer vision, by employing various pre-defined strategies to map 2D image features into 1D sequences. ViM [70] and VMamba [38] utilize a multi-directional raster-scanning strategy, while LocalMamba [31] further confines its propagation range within a local window. They have successfully adapted Mamba to image inputs. Nevertheless, as shown in Fig. 1(a), both raster-scanning and local-scanning strategies introduce spatial discontinuities between adjacent pixels, and feature transformations in Mamba rely on the feature relationships, thereby impeding the effective information flow in a sequence. Additionally, PlainMamba [62] introduces a continuous scanning strategy, aiming to alleviate this issue by simply adjusting the propagation direction at discontinuous positions. However, all these methods rely on fixed propagation trajectories, which ignore the inherent spatial structure and cannot dynamically adjust the topology based on input. Therefore, this paper endeavors to explore a new perspective: introducing an input-aware topological network for feature propagation in state space models.
To achieve it, we develop a tree state space model and propose a new framework, termed GrootVL, which adaptively generates a tree topology based on the input feature and then performs feature propagation on it. Specifically, two sub-networks, GrootV and GrootL, are designed for visual and language tasks respectively, which are illustrated in Fig. 1(b) and Fig. 1(d). For visual tasks, motivated by [64, 50], we first utilize the dissimilarity between adjacent features to construct a minimum spanning tree on a four-connected planner graph. This process can adaptively encode the spatial and semantic information into a tree graph [64, 50]. Then, we iteratively traverse each pixel, considering it as the root vertex, and aggregate the features of other pixels using the state transition function of Mamba. Intuitively, this operation requires two levels of traversal across the entire pixel set, resulting in an unacceptable quadratic complexity relative to the number of pixels. However, given that the tree graph is acyclic, we propose a dynamic programming algorithm to achieve linear complexity propagation. With such an input-aware tree topology, our approach enables more effective long-range interactions while maintaining consistent linear complexity with Mamba. Furthermore, our method can also be applied to language tasks by constructing a tree typology based on the dissimilarity between token features, which overcomes the geometrical constraints of the text sequence. Using a similar aggregation process as GrootV, GrootL can significantly enhance the language representation of a pre-trained Large Language Model [18].
We conduct extensive experiments to validate the effectiveness of GrootV on multiple visual benchmarks, image classification on ImageNet [12], object detection and instance segmentation on MSCOCO [36] as well as semantic segmentation on ADE20K [68]. Results show that our method notably outperforms existing SSM-based methods for all benchmarks and achieves competitive performance with CNN and Transformer-based approaches. Moreover, with LoRA finetuning [30], GrootL demonstrates consistent improvements for a pre-trained large language model at minor training cost.
2 Related Work
2.1 Conventional Vision Foundation Models
The evolution of deep neural networks has been a significant catalyst in machine vision perception. CNN-based models [27, 47, 32, 24, 56, 65, 35, 51, 66] firstly emerge as pivotal landmarks, with ResNet [27] notably standing out for its inventive residual connection module, garnering widespread adoption across diverse domains of visual recognition. Furthermore, more efficient convolution operations are formulated, such as depth-wise convolutions introduced by MobileNet [29], paving the way for lightweight models. Additionally, deformable convolution [10] has been proposed to enhance the receptive field. Subsequently, ViT [15] has significantly improved the vision recognition paradigm. It reformulates the architecture design and training mechanism by combining transformer architecture in natural language processing, aiming to improve computational efficiency and broaden the scope of applications. After research discourse is centred on hierarchical ViTs [40, 39, 11, 58, 14, 52, 5] which design networks by decreasing feature resolution across the backbone gradually. Furthermore, recent research built on CNN serves to re-emphasize the capabilities of convolutional networks. For example, InternImage [57] presents a large model based on deformable CNN, while UniRepLKNet [13] exhibits significant performance through large kernel convolution.
2.2 Explorations about State Space Models
State space models (SSMs) have emerged as a novel class of models within the deep learning paradigm, showing significant potential for sequence transforming [22, 21, 48]. These methods have attracted significant attention due to their linear scalability with sequence length. The early method, LSSL [22], draws inspiration from continuous state space models in control systems and attempts to address the long-range dependency problem through a combination with HIPPO [19] initialization. S4 [21] proposes to normalize the parameters into a diagonal matrix, prompting a subsequent series of research on structured SSMs [23, 20, 25, 18]. Recently, the Selective State Space Model [18], known as Mamba, strikes a balance between effectiveness and efficiency through the design of an input-dependent parameter initialization strategy, which has emerged as a formidable competitor to both transformer and CNN structures. In addition to showcasing superior outcomes in sequence modeling, Mamba has been seamlessly incorporated into the visual domain [70, 38, 31, 62]. These studies often rely on handcrafted fixed scanning mechanisms to mitigate the execution bias of the selective state space model on 2D non-causal images. However, such simplistic approaches cannot effectively capture spatial relationships in an input-dependent paradigm. To address this limitation, we propose an effective framework GrootVL in this work to enhance long-range modeling for both vision and language tasks by introducing an input-aware tree-based topological structure.
3 Method
In this section, we first revisit the selective state space model [18] and then elaborate on our input-aware topology scanning algorithm for state space modeling. With this superior algorithm, we develop a tree SSM and propose a novel framework called GrootVL, which consists of two sub-networks: GrootV for visual tasks and GrootL for fine-tuning a pre-trained language model [18].
3.1 Revisiting Selective State Space Model
State Space Models (SSMs) are commonly regarded as continuous linear time-invariant systems [59] that map input stimulation to output signal through a state vector , where , and indicate the time step, channel number of the signal and state size, respectively. These models can be formulated as the following linear ordinary differential equations:
(1) |
where , , and feedthrough coefficient .
Discretization.
Although SSM serves as a powerful tool in systems and control engineering, its time-continuous nature poses challenges for integration into deep learning architectures. To alleviate this issue, most methods utilize the zero-order hold rule [18] to discretize the continuous system described by Eq. 1 and convert continuous variables (, , , ) into corresponding discrete parameters (, , , ) over the specified sampling time-scale :
(2) |
In addition, many improved methods [38, 18] use an approximation of based on the first-order Taylor Series:
(3) |
Selective Mechanism .
Previous SSMs store information through finite states and inherent time-invariance, which limits their effectiveness. Therefore, Mamba [18] introduces a dynamic mechanism to selectively filter out input into a sequential state. Specifically, it utilizes Linear Projection to calculate the parameters , and from the input sequence with directly to improve the context-aware ability. Then the output sequence can be computed with those input-adaptive discretized parameters as follows:
(4) |
3.2 Tree State Space Model
Mamba [18] has showcased remarkable performance in modeling the dependencies of consecutive words in a sequence. However, its applicability in long-context tasks, especially visual modeling, still poses certain challenges. For visual tasks, many methods attempt to address this problem by employing fixed scanning strategies, such as multi-directional raster scan [38, 70], local scan [31], and continuous scan [62]. However, these handcrafted scanning methods fail to effectively preserve the 2D structural information of images.
Following the design in Mamba [18], we construct a transform block as a tree state space model, which is presented in Fig. 2. The only difference between our block and Mamba lies in the replacement of the structured state space block with the proposed tree scanning algorithm. In the tree scanning algorithm, we generate a tree typology and then propagate the state of each vertex along the topological path to obtain strong feature representations. In addition, our algorithm can effectively enhance language representations by incorporating such a tree topology during text processing, which overcomes the geometrical constraints of text sequences. In the following, we elaborate on the proposed tree scanning algorithm and its applications for multi-modal tasks.
Tree Scanning Algorithm.
Given an input feature where is the sequence length (or the number of input pixels), we construct an undirected -connected graph for the feature. is a hyper-parameter that indicates the number of adjacent tokens. Following [64, 50], we set for visual tasks, meaning each pixel is connected to its four neighboring pixels. For language tasks, we set by default, meaning each token is connected to the previous three tokens. In addition, the vertices represent the pixel (or token) embeddings, and the indicates the edges of the graph. The edge weight is calculated by the feature dissimilarity between adjacent vertices. Besides, the metric of dissimilarity uses cosine distance by default, and the comparison with other metrics refers to Table 6.
We use the Contractive Boruvka algorithm [2] to prune the edges with significant dissimilarity, which generates a minimum spanning tree (MST) whose sum of dissimilarity weights is minimum out of all spanning trees. In the propagation process, we iteratively traverse each vertex, treating it as the root, and aggregate the features of the remaining vertices. Intuitively, applying state propagation within such a geometric configuration makes its preferential interactions among vertices with small spatial and feature distances. Following the Mamba, we employ the data-dependent transition matrix for state propagation. For a vertex , we denote the transition matrix with its parent as . Furthermore, following the Eq. 4, the state aggregation process for the -th vertex can be formulated as:
(5) |
where denotes the index set of all vertices in the tree. represents the path weight of hyperedge traced from -th vertex to -th vertex in the tree , and indicates the index set of all vertices on this hyperedge. For visual tasks, we iterate over each vertex, treating it as the root of the spanning tree , and aggregate the states from the other vertices, thereby obtaining the transformed states . For textual tasks, because of the causal prediction manner in large language models, we only take the last token as root and aggregate from other tokens. To achieve end-to-end training, we derive the derivative of the output hidden state to the input variables , and as follows:
(6) |
(7) |
where indicates the children of vertex in tree whose root is the vertex , and denotes the parent of vertex in Eq. 7. Finally, the output feature can be formulated as:
(8) |
where , and indicate the stack of , and respectively. denotes the element-wise multiplication.
Efficient Implementation for Multi-Modality.
For visual tasks, the tree scanning algorithm requires two levels of traversal across the entire pixel set, resulting in an unacceptable quadratic complexity relative to the number of pixels . To alleviate this issue, we utilize a dynamic programming procedure to accelerate the inference and training processes as elaborated in Algorithm 1, which results in linear complexity . For textual tasks, we perform a unidirectional aggregation approach (shown in Algorithm 2 of Appendix B) in adherence to the causal nature of language. Moreover, we provide the back-propagation process for both Vision Tree Scanning and Language Tree Scanning processes, whose detailed proofs refer to Appendix C.
3.3 Application for Vision and Language
GrootV
Given an image with a shape of , our goal is to obtain high-quality visual features for downstream tasks. To this end, we propose an effective vision architecture GrootV which consists of a stem module, several basic blocks and downsampling layers to generate hierarchical representations illustrated in Fig. 3. Overall, our GrootV comprises four stages similar to previous general vision backbones [41, 40, 57, 38]. We integrate the stem module before the first stage to decrease the resolution of the input image signal by a factor of , resulting in a feature map with a shape of . It includes two convolutions, two Layer Normalization (LN) layers and one GELU activation function. The kernel size for both convolutions is with a stride of and padding of . Similarly, a downsampling layer consists of a convolution with a stride of and padding of and an LN layer. Positioned between two stages, it serves to downsample the input feature map by a factor of . Motivated by [57, 38], we devise a residual block with skip connections to integrate our fundamental Tree State Space Model in Sec. 3.2. In detail, we first normalize the input features with LN layer. Spatial priors and long-range dependencies are then obtained through our tree scanning algorithm with residual connections established alongside the input features. Finally, a feedforward neural network is utilized to project the normalized features to output signals as shown in Fig. 3. Based on the above origin components, we develop our GrootV in three scales, , GrootV-Tiny, GrootV-Small and GrootV-Base.
GrootL
Recurrent neural networks rely on fixed memory to preserve past information, which poses limitations when handling long contexts where relevant words are distant from the current moment. While Mamba [18] employs a selection mechanism to enhance context awareness, its fixed memory size cannot expand over time, resulting in restricted state space. Therefore, the ability to extrapolate decreases during scrolling as the prompt extends. To mitigate this issue, we propose an effective fine-tuning paradigm. Specifically, the tree-based topology branch is built upon one-way scrolling with a scaling factor, enabling state transitions within such a structure. This arrangement facilitates the preferential interaction of semantically related tokens. It is noteworthy that this paradigm does not introduce any additional training parameters. Instead, it utilizes pretrained state transformation parameters to conduct semantic aggregation by incorporating topological structures. Experimental results demonstrate the effectiveness of our approach.
Method | Typ | #Param. | #FLOPs | Top-1 |
Acc. | ||||
Deit-S [54] | T | 22M | 4.6G | 79.9 |
Swin-T [40] | T | 28M | 4.6G | 81.3 |
CoAtNet-0 [11] | T | 25M | 4.0G | 81.6 |
SG-Former-S [46] | T | 23M | 4.8G | 83.2 |
ConvNeXt-T [41] | C | 29M | 4.5G | 82.1 |
SLaK-T [37] | C | 30M | 5.0G | 82.5 |
UniRepLKNet-T [13] | C | 31M | 4.9G | 83.2 |
InternImage-T [57] | C | 30M | 5.0G | 83.5 |
ViM-S [70] | S | 26M | 5.1G | 80.5 |
LocalViM-S [31] | S | 28M | 4.8G | 81.2 |
PlainMamba-L2 [62] | S | 25M | 8.1G | 81.6 |
Mamba-2D-S [34] | S | 24M | - | 81.7 |
S4ND-ConvNeXt-T [44] | S | 30M | - | 82.2 |
VMamba-T [38] | S | 31M | 4.9G | 82.5 |
LocalVMamba-T [31] | S | 26M | 5.7G | 82.7 |
GrootV-T (Ours) | S | 30M | 4.8G | 83.4 |
Swin-S [40] | T | 50M | 8.7G | 83.0 |
CoAtNet-1 [11] | T | 42M | 8.0G | 83.3 |
Method | Typ | #Param. | #FLOPs | Top-1 |
Acc. | ||||
ConvNeXt-S [41] | C | 50M | 8.7G | 83.1 |
SLaK-S [37] | C | 55M | 9.8G | 83.8 |
UniRepLKNet-S [13] | C | 56M | 9.1G | 83.9 |
InternImage-S [57] | C | 50M | 8.0G | 84.2 |
HyenaViT-B [16] | S | 88M | - | 78.5 |
S4ND-ViT-B [44] | S | 89M | - | 80.4 |
PlainMamba-L3 [62] | S | 50M | 14.4G | 82.3 |
VMamba-S [38] | S | 50M | 8.7G | 83.6 |
LocalVMamba-S [31] | S | 50M | 11.4G | 83.7 |
GrootV-S (Ours) | S | 51M | 8.5G | 84.2 |
Deit-B [54] | T | 86M | 55.4G | 83.1 |
Swin-B [40] | T | 88M | 15.4G | 83.5 |
CoAtNet-2 [11] | T | 75M | 16.0G | 84.1 |
ConvNeXt-B [41] | C | 89M | 15.4G | 83.8 |
SLaK-B [37] | C | 95M | 17.0G | 84.0 |
Mamba-2D-B [34] | S | 92M | - | 83.0 |
VMamba-B [38] | S | 89M | 15.4G | 83.9 |
GrootV-B (Ours) | S | 91M | 15.1G | 84.8 |
4 Experiments
We conduct extensive experiments to evaluate the effectiveness of GrootV and compare it with advanced CNN-based, Transformer-based, and SSM-based models covering various downstream tasks, including image classification, object detection and semantic segmentation. Furthermore, we validate the capability of GrootL in the field of natural language understanding.
4.1 Image Classification
Settings.
We assess the classification performance of GrootV on the ImageNet-1k dataset [12]. Following previous practices [40, 41, 57, 38], all GrootV models are trained for epochs from scratch using AdamW optimizer with a warm-up strategy of epochs. During training, we utilize a Cosine Scheduler with an initial learning rate of and weight decay of . In addition, the exponential moving average (EMA) is also applied.
Results.
The comparison results summarized in Table 1 show GrootV leading all SSM-based models and competitive with advanced CNNs and Transformers across tiny, small, and base scales. Specifically, GrootV-T achieves Top-1 Acc. boosting ViM-S by , LocalVim-S by , PlainMamba-L2 by and VMamba-T by with similar FLOPs. Additionally, it surpasses ConvNeXt-T by and Swin-T by , demonstrating the effectiveness of our method.
4.2 Object Detection
Settings.
We verify the detection performance of GrootV on the MSCOCO 2017 dataset [36] with MMDetection library [3]. We follow previous works [38, 57, 40, 31, 49, 51, 67, 63, 6] to validate object detection and instance segmentation tasks with Mask-RCNN [26]. Specifically, We adopt the AdamW optimizer with a learning rate of and batch size of to optimize the model built upon our pre-trained classification backbones on ImageNet-1K. The training schedules include ( epochs) and ( epochs) with multi-scale data augmentation.
Results.
As depicted in Table 7 (in Appendix A.), our method outperforms existing methods on most evaluation metrics, especially for instance segmentation. Under schedule, GrootV-T achieves in box mAP (APb), which is points higher than ViM-S and points higher than VMamba-T. It is worth noting that GrootV-T outperforms ViM-S by points with schedule and LocalVMamba-T by points with schedule in mask mAP (APm). Moreover, the best APb and APm are obtained by GrootV-S in schedule with multi-scale training.
Method | Typ | #FLOPs | mIoU | mIoU |
SS | MS | |||
Swin-T [40] | T | 945G | 44.5 | 45.8 |
ConvNeXt-T [41] | C | 939G | 46.0 | 46.7 |
SLaK-T [37] | C | 936G | 47.6 | - |
InternImage-T [57] | C | 944G | 47.9 | 48.1 |
UniRepLKNet-T [13] | C | 946G | 48.6 | 49.1 |
ViM-S [70] | S | - | 44.9 | - |
LocalViM-S [31] | S | 297G | 46.4 | 47.5 |
PlainMamba-L2 [62] | S | 285G | 46.8 | - |
VMamba-T [38] | S | 964G | 47.3 | 48.3 |
LocalVMamba-T [38] | S | 970G | 47.9 | 49.1 |
GrootV-T (Ours) | S | 941G | 48.5 | 49.4 |
Swin-S [40] | T | 1038G | 47.6 | 49.5 |
ConvNeXt-S [41] | C | 1027G | 48.7 | 49.6 |
SLaK-S [37] | C | 1028G | 49.4 | - |
InternImage-S [57] | C | 1017G | 50.1 | 50.9 |
UniRepLKNet-S [13] | C | 1036G | 50.5 | 51.0 |
PlainMamba-L3 [62] | S | 419G | 49.1 | - |
VMamba-S [38] | S | 1081G | 49.5 | 50.5 |
LocalVMamba-S [31] | S | 1095G | 50.0 | 51.0 |
GrootV-S (Ours) | S | 1019G | 50.7 | 51.7 |
4.3 Semantic Segmentation
Settings.
Results.
Our method performs exceptionally well on segmentation tasks shown in Fig. 4. GrootV-T yields a clear improvement of in single-scale mIoU compared to ViM-S and in multi-scale mIoU compared to LocalViM-S. Furthermore, GrootV-S boosts InterImage-S by and in single-scale and multi-scale respectively. We consider the preservation of intricate structural details through tree topology scanning to be particularly advantageous for segmentation tasks that require pixel-level perception.
Method | PIQA | Arc-E | SST | WG | L-ppl | Race | BQA | Average |
Acc. | ||||||||
Mamba [18] | 64.5 | 48.0 | 65.6 | 51.8 | 16.1 | 27.4 | 16.8 | 45.7 |
+ LoRA [30] | 64.7 | 48.3 | 65.1 | 52.2 | 17.7 | 28.6 | 17.8 | 46.1 |
+ GrootL (Ours) | 65.0 | 49.8 | 69.5 | 51.1 | 15.9 | 28.9 | 19.2 | 47.2 |
4.4 Language Understanding
We regard Mamba [18] with M parameters as the base model. To verify the effectiveness of our GrootL in nature language understanding, we first fine-tune pre-trained Mamba via LoRA [30] and GrootL under the same setting with the Alpaca data [53], which contains instruction tuning data for supervised fine-tuning. Then we utilize popular language benchmarks provided in the open-sourced lm-evaluation-harness project [17] for evaluation, including PIQA [1], AI2-ARC [8], SST [55], WinoGrande, LAMBADA [45], Race [33] and Openbookqa [43]. The results in Table 3 demonstrate that our GrootL provides a benefit of in average Acc. compared to LoRA. Since the short prompt length of WinoGrande dataset, the performance degrades with a marginal gap.
Scanning Strategy | Acc |
Raster Scan | 82.6 |
Cross Scan | 83.1 |
Tree Topology Scan | 83.4 |
Distance Metric | Acc. |
82.9 | |
83.2 | |
83.4 |
Root Setting | Acc. |
First vertex | 82.9 |
Last vertex | 83.0 |
All vertices | 83.4 |
4.5 Ablation Study & Qualitative Results
In this section, we conduct analysis experiments on ImageNet-1K dataset and present some visual results to illustrate the effectiveness of our algorithm.
Scanning Strategy.
We conduct a head-to-head comparison of different scanning strategies, as shown in Table 6. The tree topology scanning outperforms previous strategies by and , highlighting the superiority of our algorithm in vision recognition.
Distance Metric.
Before generating a minimum spanning tree from a connected graph, it is important to measure the edge weights between vertices. Therefore, we validate several distance metrics as illustrated in Table 6. The results indicate that distance most effectively represents the relationship between vertices, performing better than and better than .
Root Setting.
We traverse all vertices, treating each as a root, and perform state transitions along the topological path from the other vertices toward the root. This traversal ensures that each vertex captures long-range dependencies. To verify the effectiveness of this operation, we consider only the first and last vertices as the root in Table 6. The results show reductions of and , respectively.
Qualitative Results.
To better illustrate the superiority of our scanning strategy, we visualize the affinity maps of different positions marked by the red cross in each input image. For example, we set the anchor point in the upper left corner of the sky as shown in the second row of in Fig. 4(a). Our method can easily identify white houses, flagpoles, and the sky, which raster scanning fails to achieve. This demonstrates the capability of our algorithm to preserve detailed structural information. More comparisons can be seen in Fig. 6 (in Appendix D.)
5 Conclusion & Limitations
In this paper, we propose a tree state space model to perform feature propagation on an input-aware topology. Besides, we introduce a linear complexity dynamic programming algorithm to enhance long-range interactions without increasing computational cost. With the proposed techniques, we establish the general multi-modal networks to break the original sequence constraints and achieve stronger representation capabilities. Extensive experiments demonstrate the effectiveness of our method in both visual and language tasks. The limitation of our method is that the tree structure is not a common paradigm, and it needs to be specifically optimized according to the hardware device.
References
- [1] Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al.: Piqa: Reasoning about physical commonsense in natural language. In: AAAI. pp. 7432–7439 (2020)
- [2] Borůvka, O.: O jistém problému minimálním (1926)
- [3] Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., et al.: Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
- [4] Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., Qiao, Y.: Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534 (2022)
- [5] Cheng, C., Song, L., Xue, R., Wang, H., Sun, H., Ge, Y., Shan, Y.: Meta-adapter: An online few-shot learner for vision-language model. arXiv preprint arXiv:2311.03774 (2023)
- [6] Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: Yolo-world: Real-time open-vocabulary object detection. arXiv preprint arXiv:2401.17270 (2024)
- [7] Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
- [8] Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., Tafjord, O.: Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 (2018)
- [9] Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmentation strategies from data. In: CVPR. pp. 113–123 (2019)
- [10] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV. pp. 764–773 (2017)
- [11] Dai, Z., Liu, H., Le, Q.V., Tan, M.: Coatnet: Marrying convolution and attention for all data sizes. NeurIPS 34, 3965–3977 (2021)
- [12] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255. Ieee (2009)
- [13] Ding, X., Zhang, Y., Ge, Y., Zhao, S., Song, L., Yue, X., Shan, Y.: Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition. CVPR (2023)
- [14] Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: CVPR. pp. 12124–12134 (2022)
- [15] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
- [16] Fu, D., Arora, S., Grogan, J., Johnson, I., Eyuboglu, E.S., Thomas, A., Spector, B., Poli, M., Rudra, A., Ré, C.: Monarch mixer: A simple sub-quadratic gemm-based architecture. NeurIPS 36 (2023)
- [17] Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., Zou, A.: A framework for few-shot language model evaluation (12 2023)
- [18] Gu, A., Dao, T.: Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023)
- [19] Gu, A., Dao, T., Ermon, S., Rudra, A., Ré, C.: Hippo: Recurrent memory with optimal polynomial projections. NeurIPS 33, 1474–1487 (2020)
- [20] Gu, A., Goel, K., Gupta, A., Ré, C.: On the parameterization and initialization of diagonal state space models. NeurIPS 35, 35971–35983 (2022)
- [21] Gu, A., Goel, K., Ré, C.: Efficiently modeling long sequences with structured state spaces. In: ICLR (2022)
- [22] Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., Ré, C.: Combining recurrent, convolutional, and continuous-time models with linear state space layers. NeurIPS 34, 572–585 (2021)
- [23] Gupta, A., Gu, A., Berant, J.: Diagonal state spaces are as effective as structured state spaces. NeurIPS 35, 22982–22994 (2022)
- [24] Han, K., Wang, Y., Xu, C., Guo, J., Xu, C., Wu, E., Tian, Q.: Ghostnets on heterogeneous devices via cheap operations. IJCV 130(4), 1050–1069 (2022)
- [25] Hasani, R., Lechner, M., Wang, T.H., Chahine, M., Amini, A., Rus, D.: Liquid structural state-space models. arXiv preprint arXiv:2209.12951 (2022)
- [26] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV. pp. 2961–2969 (2017)
- [27] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
- [28] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
- [29] Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
- [30] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. In: ICLR (2022)
- [31] Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C.: Localmamba: Visual state space model with windowed selective scan. arXiv preprint arXiv:2403.09338 (2024)
- [32] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. NeurIPS 25 (2012)
- [33] Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.H.: RACE: large-scale reading comprehension dataset from examinations. In: EMNLP. pp. 785–794. Association for Computational Linguistics (2017)
- [34] Li, S., Singh, H., Grover, A.: Mamba-nd: Selective state space modeling for multi-dimensional data. arXiv preprint arXiv:2402.05892 (2024)
- [35] Li, Y., Song, L., Chen, Y., Li, Z., Zhang, X., Wang, X., Sun, J.: Learning dynamic routing for semantic segmentation. In: CVPR (2020)
- [36] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755. Springer (2014)
- [37] Liu, S., Chen, T., Chen, X., Chen, X., Xiao, Q., Wu, B., Kärkkäinen, T., Pechenizkiy, M., Mocanu, D., Wang, Z.: More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv preprint arXiv:2207.03620 (2022)
- [38] Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Liu, Y.: Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166 (2024)
- [39] Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al.: Swin transformer v2: Scaling up capacity and resolution. In: CVPR. pp. 12009–12019 (2022)
- [40] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV. pp. 10012–10022 (2021)
- [41] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR. pp. 11976–11986 (2022)
- [42] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
- [43] Mihaylov, T., Clark, P., Khot, T., Sabharwal, A.: Can a suit of armor conduct electricity? A new dataset for open book question answering. In: EMNLP. pp. 2381–2391. Association for Computational Linguistics (2018)
- [44] Nguyen, E., Goel, K., Gu, A., Downs, G.W., Shah, P., Dao, T., Baccus, S.A., Ré, C.: S4nd: Modeling images and videos as multidimensional signals using state spaces. arXiv preprint arXiv:2210.06583 (2022)
- [45] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
- [46] Ren, S., Yang, X., Liu, S., Wang, X.: Sg-former: Self-guided transformer with evolving token reallocation. In: ICCV. pp. 6003–6014 (2023)
- [47] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Bengio, Y., LeCun, Y. (eds.) ICLR (2015)
- [48] Smith, J.T., Warrington, A., Linderman, S.W.: Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933 (2022)
- [49] Song, L., Li, Y., Jiang, Z., Li, Z., Sun, H., Sun, J., Zheng, N.: Fine-grained dynamic head for object detection. NIPS (2020)
- [50] Song, L., Li, Y., Li, Z., Yu, G., Sun, H., Sun, J., Zheng, N.: Learnable tree filter for structure-preserving feature transform. NeurIPS 32 (2019)
- [51] Song, L., Zhang, S., Yu, G., Sun, H.: Tacnet: Transition-aware context network for spatio-temporal action detection. In: CVPR (2019)
- [52] Song, L., Zhang, S., Liu, S., Li, Z., He, X., Sun, H., Sun, J., Zheng, N.: Dynamic grained encoder for vision transformers. NIPS (2021)
- [53] Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., Hashimoto, T.B.: Stanford alpaca: An instruction-following llama model (2023)
- [54] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML. pp. 10347–10357. PMLR (2021)
- [55] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: ICLR (2019)
- [56] Wang, J., Song, L., Li, Z., Sun, H., Sun, J., Zheng, N.: End-to-end object detection with fully convolutional network. In: CVPR (2021)
- [57] Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., Li, H., et al.: Internimage: Exploring large-scale vision foundation models with deformable convolutions. In: CVPR. pp. 14408–14419 (2023)
- [58] Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: ICCV. pp. 568–578 (2021)
- [59] Williams, R.L., Lawrence, D.A., et al.: Linear state-space control systems. John Wiley & Sons (2007)
- [60] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV. pp. 418–434 (2018)
- [61] Xiao, Y., Luo, Z., Liu, Y., Ma, Y., Bian, H., Ji, Y., Yang, Y., Li, X.: Bridging the gap: A unified video comprehension framework for moment retrieval and highlight detection. CVPR (2024)
- [62] Yang, C., Chen, Z., Espinosa, M., Ericsson, L., Wang, Z., Liu, J., Crowley, E.J.: Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv preprint arXiv:2403.17695 (2024)
- [63] Yang, J., Song, L., Liu, S., Li, Z., Li, X., Sun, H., Sun, J., Zheng, N.: Dbq-ssd: Dynamic ball query for efficient 3d object detection. arXiv preprint arXiv:2207.10909 (2022)
- [64] Yang, Q.: Stereo matching using tree filtering. IEEE TPAMI 37(4), 834–846 (2014)
- [65] Yang, R., Song, L., Ge, Y., Li, X.: Boxsnake: Polygonal instance segmentation with box supervision. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
- [66] Zhang, S., Song, L., Gao, C., Sang, N.: Glnet: Global local network for weakly supervised action localization. IEEE Transactions on Multimedia 22(10), 2610–2622 (2019)
- [67] Zhang, S., Song, L., Liu, S., Ge, Z., Li, Z., He, X., Sun, J.: Workshop on autonomous driving at cvpr 2021: Technical report for streaming perception challenge. arXiv preprint arXiv:2108.04230 (2021)
- [68] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: CVPR. pp. 633–641 (2017)
- [69] Zhou, H., Yang, R., Zhang, Y., Duan, H., Huang, Y., Hu, R., Li, X., Zheng, Y.: Unihead: unifying multi-perception for detection heads. TNNLS (2023)
- [70] Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X.: Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 (2024)
Appendix
Appendix A Detailed Training Settings and Results
A.1 Image Classification.
We follow the previous works [57, 38, 40] to conduct the experiments. The models are trained with thirty-two 32GB V100 GPUs by default. We set betas and momentum of the AdamW [42, 69, 61] optimizer with and , respectively. During training, we utilize a Cosine Scheduler with an initial learning rate of and weight decay of . We adopt the common training data augmentation strategies following [31, 57], including AutoAugment [9] with . A MixUp strategy with a ratio of is also adopted in each batch. Horizontal flip and Random resized crop strategy are both used in the process of training.
Performance Comparison.
We compare various SSM-based visual foundation models as shown in Fig. 5, with different colors representing different models and different shapes indicating different model scales. The size of each shape indicates the number of model parameters. The horizontal axis denotes FLOPs and the vertical axis represents the Top-1 accuracy of the corresponding method on ImageNet-1K val dataset. Fig. 5 demonstrates that GrootV is the best choice in terms of efficiency and effectiveness.
A.2 Object Detection.
For a fair comparison, we conduct the evaluation following common practice [57, 38, 40]. The models are trained with eight 32GB V100 GPUs by default. The input image is resized so that the shorter side is pixels, while the longer side does not exceed pixels during the schedule. The number of warmup steps is set to in the schedule. For schedule, the shorter side is resized to pixels and the longer side does not exceed pixels. The number of warmup steps is set to in schedule. Results shown in Table 7 demonstrate the effectiveness of GrootV in object detection and instance segmentation on COCO val2017.
Method | #FLOPs. | Mask R-CNN 1 Zeitplan | Mask R-CNN 3 MS Schedule | ||||||||||
APb | AP | AP | APm | AP | AP | APb | AP | AP | APm | AP | AP | ||
Swin-T [40] | 267G | 42.7 | 65.2 | 46.8 | 39.3 | 62.2 | 42.2 | 46.0 | 68.1 | 50.3 | 41.6 | 65.1 | 44.9 |
ConvNeXt-T [41] | 262G | 44.2 | 66.6 | 48.3 | 40.1 | 63.3 | 42.8 | 46.2 | 67.9 | 50.8 | 41.7 | 65.0 | 44.9 |
CSWin-T [14] | 279G | 46.7 | 68.6 | 51.3 | 42.2 | 65.6 | 45.4 | 49.0 | 70.7 | 53.7 | 43.6 | 67.9 | 46.6 |
ViM-S [70] | 218G | 44.9 | 67.1 | 49.3 | 41.0 | 64.2 | 44.1 | - | - | - | - | - | - |
VMamba-T [38] | 286G | 46.5 | 68.5 | 50.7 | 42.1 | 65.5 | 45.3 | 48.5 | 69.9 | 52.9 | 43.2 | 66.8 | 46.3 |
L-Vmamba-T [31] | 291G | 46.7 | 68.7 | 50.8 | 42.2 | 65.7 | 45.5 | 48.7 | 70.1 | 53.0 | 43.4 | 67.0 | 46.4 |
GrootV-T (Ours) | 265G | 47.0 | 69.4 | 51.5 | 42.7 | 66.4 | 46.0 | 49.0 | 70.8 | 54.0 | 43.8 | 67.6 | 47.1 |
Vit-Adapter-S [4] | 403G | 44.7 | 65.8 | 48.3 | 39.9 | 62.5 | 42.8 | 48.2 | 69.7 | 52.5 | 42.8 | 66.4 | 45.9 |
Swin-S [40] | 354G | 44.8 | 66.6 | 48.9 | 40.9 | 63.4 | 44.2 | 48.2 | 69.8 | 52.8 | 43.2 | 67.0 | 46.1 |
ConvNeXt-T [41] | 348G | 45.4 | 67.9 | 50.0 | 41.8 | 65.2 | 45.1 | 47.9 | 70.0 | 52.7 | 42.9 | 66.9 | 46.2 |
InternImage-S [57] | 340G | 47.8 | 69.8 | 52.8 | 43.3 | 67.1 | 46.7 | 49.7 | 71.1 | 54.5 | 44.5 | 68.5 | 47.8 |
VMamba-S [38] | 400G | 48.2 | 69.7 | 52.5 | 43.0 | 66.6 | 46.4 | 49.7 | 70.4 | 54.2 | 44.0 | 67.6 | 47.3 |
L-Vmamba-S [31] | 414G | 48.4 | 69.9 | 52.7 | 43.2 | 66.7 | 46.5 | 49.9 | 70.5 | 54.4 | 44.1 | 67.8 | 47.4 |
GrootV-S (Ours) | 341G | 48.6 | 70.3 | 53.5 | 43.6 | 67.5 | 47.1 | 50.1 | 71.2 | 54.9 | 44.6 | 68.7 | 47.8 |
A.3 Semantic Segmentation.
We optimize our GrootV-T/S using AdamW optimizer with an initial learning rate of which is decayed by a rate of with the polynomial decay schedule following [57]. The number of warmup iters is set to with an initial learning rate of . The default input resolution is as well as FLOPs are calculated with an input size of . The models are trained with eight 32GB V100 GPUs by default.
Appendix B Language Tree Topology Scanning Operator
Appendix C Algorithm Proof
In this section, we present detailed proofs for our tree scanning algorithm. The definitions of symbols are consistent with those in the main paper.
C.1 Proof for Algorithm 1.
We randomly take a vertex in the MST as the . According to the definition of the tree scanning algorithm introduced in Sec. 3.2, we can derive as follows:
(9) |
which shows a process of aggregation from all leaf vertices to the . Therefore, each vertex is only related to its child in this period. Taking vertex as an example, the can be derived as:
(10) |
We assume that one of the child of is and can be derived as following:
(11) |
where indicates the aggregation value from the vertices to vertex . Therefore, we can obtain the propagation relationship between the hidden state of parent and child :
(12) | ||||
Through the above derivation, we can calculate with only two traversals (, the aggregation from to and the propagation from to ) in the forward process as shown in Algorithm 1, thereby reducing the computational complexity from to .
Next, we analyze the backpropagation process in Algorithm 1. According to the chain rule, we can easily calculate the derivative of with respect to :
(13) | ||||
Similarly, the derivative of with respect to is:
(14) | ||||
The above formulas are equivalent to replacing the input with during the forward process.
Subsequently, we assume that the vertex is the child of vertex and define indicates the children of vertex with the root of vertex . is formulated as follows:
(15) | ||||
So far we have completed the proof of forward and back-propagation of Algorithm 1.
C.2 Proof for Algorithm 2.
We only take the last token as root and replace the transition source from to in sequence modeling tasks like nature language understanding to ensure causality. Therefore, only one traversal (from to ) is required for the forward process, and another traversal (from to ) is needed for the backpropagation process. The proof is similar to the Algorithm 1.
Appendix D More Qualitative Results
Fig. 6 displays additional qualitative comparisons between our algorithm and previous scanning strategies (, cross-scanning and raster-scanning), which shows our advanced capability to perceive detailed structural information and capture long-range dependencies.
Appendix E Statistical Significance
Method | PIQA | Arc-Easy | SST | WinoGrande | LAM-ppl | Race | Openbookqa |
GrootL (Ours) | 0.011 | 0.010 | 0.016 | 0.014 | 0.553 | 0.014 | 0.018 |
We calculate the standard deviation of our GrootL on language model benchmarks in the open-sourced lm-evaluation-harness project as shown in Table 8.