Introduction
Over the past few year, AI has become popular; however, no year matches the pace of the past few years. 2024 was an extremely crucial year for Computer Vision, with new models and breakthroughs occurring rapidly. Today, I’ll be talking about the 3 biggest advancements in the Computer Vision field during 2024.
3. Vision Language Modeling
Simply put, Vision Language Models, or VLMs for short, are Large Language Models (LLMs) that take an image and/or text as input. Think of ChatGPT, where you import an image and then type out a question that you have over the image. It will be able to understand both and output an answer. These are also what you would call a “multimodal model,” which is a model that can understand multiple different types of sources/inputs.
Although Vision Language Models have existed for a while, they have recently become better and more accurate. However, this doesn’t mean it’s all sunshine an rainbows: there are still several barriers (such as high cost of training) that prevent researches from pursuing this model more. Thus, I put this at the third spot in my ranking.
An example of a VLM. Credit: Noyan and Beeching on Hugging Face
Read a paper from Bordes et. al. 2024 on the state of VLMs (how to train VLMs, positive advancements, and issues in the models).
2. SAM 2
SAM 2 (Segment Anything Model 2) is Meta’s Segmentation Model. Segmentation is the ability for a model to pick out certain objects within an image/video. For example, if I wanted to pick out a dandelion from a lush green scenery, SAM 2 would be the model to do that. Event if it wasn’t trained on something previously, it can still accurately segment the object.
Types of Image Segmentation. Credit: Mindy Support
Breakthroughs: better accuracy, less need for interactions, and faster output times
Try the demo!
1. YOLO v11
YOLO models have been around for a long time, but the most recent version includes improvements on the previous models. YOLO, short for “You Only Look Once,” is primarily and object detection model that can identify and classify an image after one look (one pass through the network). Object Detection models are sort of like a segmentation model, except Object Detection isn’t about picking out a specific object, it’s about getting the general gist of it inside a shape (rectangle and squares are the most common ones).
Difference between Image Recognition and Object Detection. Credit: SmartTek Solutions
Breakthroughs: better accuracy/precision, expanded application for Computer Vision, improved architecture, improvement in speed and performance, and advanced feature extraction.
Fun Fact: YOLO v11 can now also do instance segmentation, classification, pose estimating (understanding the pose of a human), Oriented Object Detection (more accurately detect the object in a box), and Object Tracking.
Read the paper on Arxiv!
Works Referenced
These were the works I reference while researching for this article that were not mentioned above.
“What are vision language models (VLMs)?” by Caballar and Stryker (IBM).
“How does YOLO work for object detection?” by GeeksForGeeks (last updated 01 Jul, 2024).
“SAM 2: Segment Anything in Images and Videos” by Ravi et. al. (Arxiv)