Alberta Machine Intelligence Institute

CVPR 2023 Conference Highlights

Published

Aug 11, 2023

Four members of the Amii’s staff recently attended the Conference on Computer Vision and Pattern Recognition (CVPR), sponsored by the Computer Vision Foundation (CVF) and the Institute for Electrical and Electronics Engineers (IEEE). In part two below we present the main results and papers that caught our eye. Previously, in part one, we discussed the team’s take-aways from the major keynote presentations

Here are brief highlights on new or improved technologies and trends that caught our attention.

Vision Transformers (ViTs)

Inspired by the success of transformers in Natural Language Processing (NLP), ViTs were initially introduced by Google Research in 2020 as an alternative to convolutional architectures (CNNs). Although ViTs has already achieved top performance in many computer vision tasks, there are still remaining challenges, some of which were addressed in CVPR, such as generalization, efficiency, and robustness..

Regarding generalization, the Analysis-by-Synthesis ViT (AbSViT) work enables ViTs to extract task-adaptive representation that can more readily generalize to various tasks. When it comes to efficiency, Rep Identity Former (RIFormer) proposes the removal of token mixers from the ViT base to make the vision backbone more efficient. EfficientViT by Microsoft Research was another work on efficiency that suggests a new memory-efficient building block in ViTs to mitigate attention computation redundancy and make a high-speed ViT.

Finally, respecting robustness, several techniques were presented to make ViTs robust to adversarial attacks, such as architectural backdoors, or corruptions such as noise and blur.

Generative models

Image generation and editing tasks have never stopped growing in popularity since the inception of Generative Adversarial Networks (GAN). In 2020, Diffusion models opened the door to generating more diverse and high-quality images while preserving their semantic structure. Diffusion models dominated the field in CVPR 2023, and there is still a lot of potential interest for further exploration

in this area. We saw exciting work on increasing controllability over the generation process, such asDreamBooth and Guided Diffusion Models by Google Research. Moreover, extending diffusion models to video generation, for example as in VideoFusion is another exciting research direction.

Foundation Models & Visual Prompting

Similar to how large language models decoding semantic meaning from textual input have become foundational models, pre-trained models in the context of vision serve as ​​fundamental building blocks for cutting-edge technologies like autonomous vehicles, robotics, healthcare, and more.

In the healthcare domain, Google Deepmind was one of the pioneers, presenting REMEDIS, a unified self-supervised learning framework for building foundation medical AI to address the key translational challenges in medical AI, such as generalization, reliability and interactivity. 3D scene understanding was also seen as a promising domain for incorporating foundational models for vision, graphics and robotics, despite the fundamental existing challenges in collecting large-scale 3D datasets, limited annotation resources and limited scale of 3D interaction and reasoning data.

Autonomous driving also benefited from foundational models, a fact highlighted in Phil Duan’s presentation. Alongside these highlights, several other fascinating papers captured the attention of attendees. MELTR (Meta Loss Transformer for Learning to Fine-tune Video Foundation Models) demonstrated an innovative approach to fine-tuning video foundation models using meta-loss transformers. Additionally, the Integrally Pre-Trained Transformer Pyramid Networks paper introduced an intriguing methodology that leverages pre-training to enhance the performance of transformer pyramid networks.

Visual prompting, in particular, enables the foundation models to generate predictions without any fine-tuning or updating weights. At inference time, relevant prompts (e.g., bounding boxes) are fed to the model to guide it to the desired output. However, building promptable models and training them to be adaptable and responsive to control inputs is still an open challenge.

Recently, Meta AI released the Segment Anything Model (SAM) as the first promptable foundation model for image segmentation. At this conference, GLIGEN was introduced as a promptable image generation model. We also saw companies such as Landing AI have started using visual prompting as an interactive data-labelling tool for some tasks, such as detection and segmentation, as it is much faster and easier compared to manual labelling.

Multimodal models

Since the invention of ViT, multimodal models have become more popular in the field. Finding a joint embedding space across different modalities such as image, text, and audio not only can result in powerful models but also promotes a variety of novel cross-modal applications. ImageBind by Meta AI was introduced as the first multimodal model that binds six modalities in one embedding space. Moreover, we saw an impressive series of works building on OpenAI’s Contrastive Language-Image Pre-Training (CLIP).

Neural Radiance Field (NeRF)

Generating highly-realistic 3D virtual worlds is the key step toward creating fully virtual worlds (e.g., Metaverse). Hence, synthesizing 3D scenes or objects from a set of 2D images using Neural Radiance Field (NeRF) has been popular since 2020. There have been many improvements in NeRF to make it a powerful tool for content generation. Improving efficiency, scalability, and fidelity are common advances in NeRF. For example, a highlight paper, MobileNerF, introduced a new NeRF representation based on textured polygons to optimize NeRF for mobile devices. F2-NeRF was another interesting work that not only reduced computational complexity but also introduced flexibility on captured images. Other research in this area includes generating 3D from a sparse set of images and dynamic scenes handling.

Responsible AI

As AI technologies advance and become more pervasive, the focus on responsible AI practices becomes paramount. Responsible AI encompasses principles like transparency, fairness, and privacy to ensure the ethical deployment of AI systems. While CVPR-2023 showcased remarkable advancements, it also shed light on the ongoing need to address privacy concerns in emerging technologies. Examples such as Meta's project Aria — capable of recording activities within one's home — or autonomous cars (with companies like Tesla, Zoox, etc.) capturing data on individuals in public spaces raised important questions regarding the responsible use and protection of personal information. At CVPR-2023, researchers and companies demonstrated a strong awareness of these privacy concerns, engaging with experts from various disciplines to find solutions. This collaborative effort aims to foster a comprehensive understanding of technological advancements and their impact on individuals, ultimately safeguarding privacy and safety for all.

Reinforcement Learning and Vision

As one of the world leaders in Reinforcement Learning (RL), Amii was particularly interested and excited to explore the synergy between RL and vision-based applications. One standout example of this combination was GAIA-1, an autonomous driving model developed by Wayve. By integrating RL techniques with video, text, and action inputs, GAIA-1 generated realistic driving videos while demonstrating fine-grained control over ego-vehicle behaviour and scene features. Its ability to predict multiple plausible futures and extrapolate to novel scenarios showcased the model's groundbreaking capabilities in safety evaluation and corner case handling.

Moreover, vision-based RL emerged as a captivating field, seamlessly integrating RL with visual observations to derive state representations directly from images or video frames. Self-Supervised Paired Similarity Representation Learning (PSRL), a notable contribution, addressed the challenge of capturing both global and local spatial structures in vision-based RL. The conference also included exciting research in RL applied to computer vision. A black-box model inversion attack demonstrated how RL agents could reconstruct private data used to train ML models, achieving SOTA attack performance. Furthermore, ESPER (Extending Sensory PErception with Reinforcement learning) empowered text-only pre-trained models to tackle multimodal tasks like visual commonsense reasoning, outperforming previous approaches and providing a benchmark dataset (ESPDataset) for comprehensive evaluation.

The synergy between RL and computer vision drives breakthroughs in diverse applications, revolutionizing industries and inspiring future advancements. This collaboration unlocks exciting possibilities and novel solutions to complex challenges in the real world.

Authors

Share