Visual object counting—the task of estimating the number of objects in an image—has seen remarkable advancements in recent years, driven by transformer architectures and multimodal learning. This blog post explores three groundbreaking approaches: CounTR, CounTX, and CountGD, each pushing the boundaries of what’s possible in open-world counting.
CounTR: Transformer-Based Generalized Visual Counting
Introduced in 2022, CounTR (Counting Transformer) revolutionized class-agnostic counting by leveraging transformer architectures to count objects from arbitrary categories using minimal exemplars (zero-shot or few-shot).
Contributions
- Transformer Architecture: CounTR uses a ViT (Vision Transformer) backbone to capture patch-wise similarities via attention mechanisms, enabling it to generalize across unseen object categories.
- Two-Stage Training:
- Self-supervised pre-training: Masked image modeling (MAE-style) learns robust visual representations.
- Supervised fine-tuning: The model is tuned to predict density maps, where object counts are derived by summing the map values.
- Mosaic Data Augmentation: To address data imbalance (e.g., few images with high object counts), CounTR synthesizes training images by blending patches from multiple images, enhancing diversity and scale robustness.
- Exemplar Flexibility: Unlike class-specific counters, CounTR adapts to user-provided exemplars (bounding boxes), making it versatile for real-world applications.
The architecture of CounTR.
Performance
CounTR achieved state-of-the-art results on benchmarks like FSC-147, reducing mean absolute error by 18.3% over prior methods. Its ability to count objects like cars, animals, or even abstract shapes without retraining marked a leap toward general-purpose vision systems.
CounTX: Open-World Counting via Text Prompts
Building on CounTR’s success, CounTX (2023) tackled a harder problem: counting objects specified purely by text descriptions (e.g., "count the oranges on the table") without visual exemplars.
Advancements Over CounTR
- Text-to-Count Pipeline: CounTX replaces exemplars with text queries, using a transformer decoder atop pretrained joint text-image embeddings (e.g., CLIP). This eliminates the need for bounding boxes, enabling purely language-driven counting.
- Enhanced Dataset (FSC-147-D): The authors augmented FSC-147 with detailed text descriptions, allowing richer language inputs (e.g., "ripe strawberries" vs. "strawberries").
- Single-Stage Training: Unlike CounTR’s two-stage approach, CounTX is trained end-to-end, simplifying the workflow while maintaining accuracy.
Working principle of CounTX.
Results
CounTX outperformed text-based counting methods on FSC-147-D, proving that language can effectively guide attention in counting tasks. For example, it could distinguish between "sunglasses" and "eyeglass lenses" in the same image based on text prompts.
CountGD: Multi-Modal Open-World Counting
CountGD (2024) represents the next leap in visual counting by unifying text prompts and visual exemplars in a single, flexible framework. Built on the foundation of GroundingDINO, an open-vocabulary detection model, CountGD achieves state-of-the-art performance by fusing multi-modal inputs to count objects with unprecedented accuracy and generality.
Key Innovations
- Multi-Modal Prompt Fusion:
- CountGD accepts text-only queries (e.g., "count the red apples"), visual exemplars (bounding boxes of example objects), or both simultaneously.
- The model treats visual exemplars as "tokens" alongside text, enabling dynamic interactions between modalities via self-attention and cross-attention layers.
- Example: A prompt like "count the birds on the left" combines text (semantic filtering) with exemplars (appearance grounding) to refine counts.
- Single-Stage Architecture:
- Unlike earlier two-stage systems (e.g., DAVE), CountGD performs counting in a single pass, reducing complexity and improving speed.
- It repurposes GroundingDINO's detection capabilities for counting by adding modules to aggregate detections into counts while preserving spatial accuracy.
- Open-World Flexibility:
- CountGD outperforms class-specific and text-only models (e.g., CounTX, CLIP-Count) by leveraging both visual and textual cues. On FSC-147, it achieves a MAE of 5.74 (test set), surpassing prior state-of-the-art methods by 20%.
- In zero-shot settings, it maintains robust performance (MAE: 12.98 on FSC-147 test) by generalizing to unseen categories.
Conclusion
The progression from CounTR to CounTX and then CountGD shows a significant shift in how we interact with object counting models. Each model has expanded our ability to specify which objects we want to count, moving from visual cues to text descriptions and finally to a multi-modal approach. This evolution not only enhances the accuracy and versatility of object counting but also opens up new possibilities for applications where precise object specification is essential.
References
- Chang Liu et al., "CounTR: Transformer-based Generalised Visual Counting," arXiv, 2022. [Online]. Available: https://arxiv.org/abs/2208.13721
- Chang Liu et al., "CounTR: Transformer-based Generalised Visual Counting," BMVC 2022. [Online]. Available: https://bmvc2022.mpi-inf.mpg.de/0370.pdf
- Niki Amini-Naieni et al., "Open-world Text-specified Object Counting," arXiv, 2023. [Online]. Available: https://arxiv.org/abs/2306.01851
- CounTX: Open-world Text-specified Object Counting. [Online]. Available: https://www.robots.ox.ac.uk/~vgg/research/countx/
- AIModels.fyi. CountGD: Multi-Modal Open-World Counting. [Online]. Available: https://www.aimodels.fyi/papers/arxiv/countgd-multi-modal-open-world-counting
- CountGD: Multi-Modal Open-World Counting - University of Oxford. [Online]. Available: https://www.robots.ox.ac.uk/~vgg/research/countgd/
- Niki Amini-Naieni et al., "CountGD: Multi-Modal Open-World Counting," OpenReview, 2024. [Online]. Available: https://openreview.net/forum?id=eUg64OsGDE
- Niki Amini-Naieni et al., "CountGD: Multi-Modal Open-World Counting," arXiv, 2024. [Online]. Available: https://arxiv.org/abs/2407.04619