The ability to accurately interpret complex visual information is a key focus of multimodal large-scale language models (MLLMs). Recent studies have shown that enhanced visual perception can significantly reduce hallucinations and improve performance in resolution-sensitive tasks such as optical character recognition and document analysis. Several recent MLLMs achieve this by utilizing a mixture of vision encoders. Despite their success, systematic comparisons and detailed ablation studies addressing key aspects such as expert selection and integration of multiple vision experts are lacking. In this article, we extensively explore the design space of MLLMs using a mixture of vision encoders and resolutions. It is an Eagle framework that explores the design space of multimodal large-scale language models using a mixture of encoders. Our findings reveal several fundamental principles common to various existing strategies, leading to a streamlined, yet effective design approach. Eagle finds that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixture architectures and strategies. Additionally, Eagle introduces Pre-Alignment, which bridges the gap between vision-focused encoders and language tokens and enhances model consistency. The resulting MLLM family, Eagle, outperforms other leading open source models on key MLLM benchmarks.
Eagle’s research is related to the general architectural design of multimodal large-scale language models (MLLMs). In addition to the representative open source research lines mentioned above, other notable MLLM families include, but are not limited to, MiniGPT-4, Lynx, Otter, QwenVL, CogVLM, VILA, GPT-4V, Gemini, and Llama 3.1. Depending on how visual signals are integrated into the language model, MLLMs can be roughly categorized into “cross-modal attention” models and “prefix tuning” models. The former uses cross-modal attention to inject visual information into different layers of the LLM, while the latter treats visual tokens as part of the language token sequence and adds text embeddings directly. Eagle’s model belongs to the prefix tuning family by following an LLaVA-style multimodal architecture. As MLLM is a rapidly growing field, Eagle encourages readers to refer to more in-depth studies and surveys for further insights.
Eagle’s work is closely related to research focused on improving vision encoder design for MLLM. Early work typically employed vision encoders pre-trained on vision-language alignment tasks such as CLIP and EVA-CLIP. More powerful vision encoders such as SigLIP and InternVL have been proposed to enhance vision-language tasks with better designs, larger model sizes, and more effective training recipes. High-resolution adaptation is frequently performed to increase the MLLM input resolution, since models are often pre-trained on low-resolution images and may lack the ability to encode fine details. In addition to high-resolution adaptation, models such as LLaVA-NeXT, LLaVA-UHD, Monkey, InternLM-XComposer, and InternVL use tiling or adaptive tiling to process high-resolution inputs. In this case, the image is split into low-resolution patches that are processed separately. Introducing additional vision experts allows for higher resolutions to be processed, but this approach is slightly different from the tiling technique. However, the two are compatible and can be combined.
The success of large-scale language models (LLMs) has sparked interest in enabling visual recognition capabilities to see, understand, and reason in the real world. At the core of these multimodal large-scale language models (MLLMs) is a typical design in which an image is converted into a set of visual tokens by a vision encoder and then a text embedding is attached. CLIP is often chosen as a vision encoder because its visual representation matches the text space by pre-training it on image-text pairs. Depending on the architecture, training recipe, and how the vision tokens are inserted into the language model, prominent families of MLLMs include Flamingo, BLIP, PaLI, PaLM-E, and LLaVA. Most of these models maintain a relatively low input resolution due to the pre-trained vision encoder and the length limit of the LLM sequence. Eagle’s work is closely related to models that use multiple vision encoders to improve recognition. Mini-Gemini and LLaVA-HR propose to fuse high-resolution visual features into low-resolution visual tokens. Besides the resolution issue, these pre-trained vision encoders may lack certain features such as text reading, object localization, etc. To address this, various models integrate vision encoders pre-trained on different vision tasks to enhance the capabilities of the vision encoder.
For example, models such as Mousi and Brave fuse visual tokens from different vision encoders by concatenating along the channel or token direction. RADIO introduces multi-teacher distillation, which combines the features of different vision encoders into one model. MoAI, IVE, and Prismer further use the output of vision experts such as OCR, detection, and depth estimation to supplement additional information for MLLM to generate answers. MoVA devise a routing network that assigns the best vision model based on a given image and instructions.
Recent studies have shown that designing stronger vision encoders is crucial for reducing MLLM hallucinations and improving performance on resolution-sensitive tasks such as optical character recognition (OCR). Several studies have focused on enhancing the capabilities of vision encoders by scaling up pre-training data and parameters or splitting images into lower-resolution patches. However, these approaches often impose large demands on training resources. An efficient and powerful strategy is to mix visual encoders pre-trained on different tasks and input resolutions. We can fuse a high-resolution encoder with a CLIP encoder, add features from different encoders in sequence, or employ more complex fusion and routing strategies to maximize the benefits of different encoders. Although this “mixing of vision experts” approach has proven effective, a detailed study of the design space with rigorous ablation is still lacking, which has motivated Eagle to revisit this area. Key questions are which combination of vision encoders to choose, how to fuse different experts, and how to tailor the training strategy with more vision encoders.
To answer these questions, Eagle systematically explores the vision encoder blend design space to improve MLLM recognition. This design space exploration involves the following steps: 1) Benchmark different vision encoders and look for high-resolution adaptations. 2) Compare vision encoder fusion strategies “on an apples-to-apples” basis. 3) Step-by-step identify the optimal combination of multiple vision encoders. 4) Improve vision expert pre-training and data blending. The exploration steps are shown in the following figure.
Eagle’s studies cover the performance of pre-trained vision encoders on a range of tasks and resolutions, including visual language alignment, self-supervised learning, detection, segmentation, OCR, etc. Using a round-robin approach, Eagle starts with a basic CLIP encoder and adds one expert at a time, selecting the expert that provides the best improvement in each round.
Although Eagle’s work is not the first to leverage multiple vision encoders in MLLM, systematic research in this setting has yielded several important findings.
- It is important to unlock the vision encoder during MLLM training. This is in contrast to models such as LLaVA, which consider multiple vision encoders or teachers and where freezing vision encoders is a common practice.
- Some of the recently proposed fusion strategies have not shown any significant benefits. Instead, simple channel interlinking emerges as a simple yet competitive convergence strategy, offering the highest efficiency and performance.
- Incorporating additional vision professionals provides consistent benefits. This makes it a promising way to systematically enhance MLLM recognition apart from scaling up a single encoder. The improvements are especially noticeable once the vision encoder is unlocked.
- The pre-conditioning stage is important. Eagle introduces a pre-training stage where text-unaligned vision experts are individually fine-tuned with a fixed LLM before being trained together, which significantly improves MLLM performance in a mixed vision encoder design.
Eagle: Methodology and Architecture
Unlike previous methods that focus on novel fusion strategies and architectures between vision encoders, Eagle’s goal is to identify a minimal design for fusing different vision encoders, supported by detailed ablation and removal of unnecessary components. As shown in the following figure, Eagle starts by extending the basic CLIP encoder to a set of vision experts with different architectures, pre-training tasks, and resolutions. Then, using these experts, Eagle compares different fusion architectures and methods and explores how to optimize pre-training strategies across multiple encoders.
Finally, Eagle integrates all findings and extends the approach to multiple expert vision encoders with different resolutions and domain knowledge.Using the same pre-training data as LLaVA-1.5, consisting of 595,000 image-text pairs, Eagle moves to a supervised fine-tuning stage by collecting data from a set of tasks and converting them into multimodal conversations, including LLaVA-1.5, Laion-GPT4V, ShareGPT-4V, DocVQA, synDog-EN, ChartQA, DVQA, and AI2D, resulting in 934,000 examples.
The model is first pre-trained using image-text pairs with one epoch with batch size 256, where the entire model is fixed and only the projector layer is updated. In the second stage, the model is fine-tuned using supervised fine-tuning data with one epoch with batch size 128. In this study, Eagle employs Vicuna-7B as the underlying language model. The learning rate is set to 1e-3 in the first stage and 2e-5 in the second stage.
More powerful CLIP encoder
Eagle begins its investigation with CLIP models, which have become the go-to choice for many MLLMs. CLIP models are known to enhance multimodal tasks, but their limitations are also well documented. For example, many existing MLLMs tend to use pre-trained CLIP resolutions (e.g., 224 × 224 or 336 × 336) as input resolutions. In these cases, the encoder often struggles to capture fine details that are important for resolution-sensitive tasks such as OCR and document understanding.
A common approach to deal with increasing input resolution is tiling, where the input image is split into tiles and encoded separately. Another, more straightforward approach is to scale up the input resolution directly and interpolate the position embeddings of the vision transformer model as needed. Eagle compares these two approaches with frozen and non-frozen vision encoders at a range of resolutions and presents the results in the table above. The results can be summarized as follows:
- Unfreezing the CLIP encoder provides significant improvements when interpolating to higher MLLM input resolutions that differ from the CLIP pre-training resolution, without performance degradation when the resolution is the same.
- Freezing the CLIP encoder and directly adapting it to the higher MLLM input resolution results in a significant performance degradation.
- Among the compared strategies, direct interpolation to 448 × 448 using the unfrozen CLIP encoder proves to be effective and efficient in terms of performance and cost.
- The best CLIP encoder achieves performance close to InternVL, despite having less pre-training data and a much smaller model (300M vs. 6B).
It is worth noting that CLIP-448 allows Eagle to match settings with LLaVA-HR and InternVL, where the CLIP encoder is similarly tuned to take 448 × 448 input and output 1024 patch tokens. To investigate further, Eagle follows this simple strategy of upscaling the input resolution and unlocking the vision encoder during training.
Eagle said existing common fusion strategies, despite their diversity in design, can be broadly categorized as follows:
- Add sequence: Directly appending visual tokens from different backbones as longer sequences.
- Channel Linking: Concatenating visual tokens along the channel dimension without increasing the sequence length.
- LLaVA-HR: Injecting high-resolution features into a low-resolution vision encoder using a resolution-mixing adapter.
- Mini Gemini: We use the CLIP token as a low-resolution query to interconnect to another high-resolution vision encoder in the same local window.
- Transformable Note: A new baseline introduced on top of Mini-Gemini, where vanilla window attention is replaced with transformable attention.
Instead of training projectors to align multiple visual experts simultaneously, as in LLaVA’s original pre-training strategy, we first align the representations of individual experts to a smaller language model (actually Vicuna-7B) with next token prediction supervision. As shown in the figure below, in pre-alignment, the whole training process consists of three steps: 1) Train each pre-trained visual expert on the SFT data with its own projector while keeping the language model fixed; 2) Combine all visual experts from the first step and train only the projectors on the image-text pair data; 3) Train the whole model on the SFT data.
Eagle: Experiments and Results
After meticulously developing the strategy, Eagle established the following principles for the model: (1) Integrate more vision experts using an optimized training recipe; (2) Combine multiple vision experts by direct channel concatenation; (3) Pre-train vision experts individually by pre-alignment. In this section, to further demonstrate the advantages of the Eagle model, additional training data is incorporated and Eagle is compared with current state-of-the-art MLLMs across various tasks. Eagle uses Vicuna-v1.5-7B, Llama3-8B, and Vicuna-v1.5-13B as language models. For the vision encoder, based on the results in Section 2.6, the Eagle model is denoted as Eagle-X4, which includes four vision encoders: CLIP, ConvNeXt, Pix2Struct, and EVA-02, and Eagle-X5, which includes an additional SAM vision encoder.
Visual question answering task
Eagle compares the model series on three Visual Question Answering (VQA) benchmarks: GQA, VQAv2, and VizWiz. As shown in the following table, Eagle-X5 achieves state-of-the-art performance on GQA and VQAv2, highlighting the benefit of incorporating additional vision experts.
OCR and chart comprehension tasks
To evaluate Eagle’s OCR, document and chart understanding capabilities, the model is benchmarked on OCRBench, TextVQA and ChartQA. As shown in the table above, Eagle benefits from its high-resolution architecture and integration of various vision encoders, significantly outperforming the competition in TextVQA. Notably, Eagle maintains a simple design, supporting up to 1024 tokens without requiring complex tile decomposition of the image.
The figure below shows the OCR and document understanding case: With higher resolution adaptation and more visual experts built in, Eagle can identify small text in images and accurately extract information based on user instructions.
To better understand the benefit of introducing pre-trained experts on other vision tasks, the following figure compares the results of a model using only the ConvNeXt and CLIP vision encoders with the results of Eagle-X5. With the full set of vision encoders, the model successfully corrects the mistakes. This shows that even when equipped with a high-resolution vision encoder pre-trained with vision language tuning, Eagle’s capabilities are further enhanced by integrating additional vision experts pre-trained on different vision tasks.
Multimodal Benchmark Evaluation
Eagle is evaluated on seven benchmarks for MLLM to demonstrate its capabilities from various perspectives. These include MME, MMBench, SEED, MathVista, MMMU, ScienceQA, and POPE. Specifically, MME, MMBench, and SEED evaluate the overall performance in various real-world tasks including inference, recognition, knowledge, and OCR. MMMU focuses on challenging problems from various domains that require university-level knowledge. POPE evaluates visual hallucinations on MLLM. The metrics used in this evaluation follow the default settings for these benchmarks. Eagle reports perception scores for MME, en_dev splits for MMBench, image splits for SEED, test-mini splits for MathVista, val splits for MMMU, F1 scores for POPE, and image scores for ScienceQA to ensure consistency with scores reported from other models.
lastly
In this article, we presented Eagle, a deep analysis of the design space for integrating vision encoders into multimodal large-scale language models. Unlike previous work focused on designing new fusion paradigms, Eagle finds that systematic design choices matter and uncovers a set of useful techniques. In stages, Eagle optimizes the training recipes of individual vision encoders, identifies scalable and efficient fusion methods, and progressively combines vision encoders with different domain knowledge. The results highlight the importance of fundamental design space considerations.