Top 3 Breakthroughs in Computer Vision in 2024

-

Introduction

Over the past few year, AI has become popular; however, no year matches the pace of the past few years. 2024 was an extremely crucial year for Computer Vision, with new models and breakthroughs occurring rapidly. Today, I’ll be talking about the 3 biggest advancements in the Computer Vision field during 2024.

3. Vision Language Modeling

Simply put, Vision Language Models, or VLMs for short, are Large Language Models (LLMs) that take an image and/or text as input. Think of ChatGPT, where you import an image and then type out a question that you have over the image. It will be able to understand both and output an answer. These are also what you would call a “multimodal model,” which is a model that can understand multiple different types of sources/inputs.
Although Vision Language Models have existed for a while, they have recently become better and more accurate. However, this doesn’t mean it’s all sunshine an rainbows: there are still several barriers (such as high cost of training) that prevent researches from pursuing this model more. Thus, I put this at the third spot in my ranking.
An example of a VLM. Credit: Noyan and Beeching on Hugging Face
Read a paper from Bordes et. al. 2024 on the state of VLMs (how to train VLMs, positive advancements, and issues in the models).

2. SAM 2

SAM 2 (Segment Anything Model 2) is Meta’s Segmentation Model. Segmentation is the ability for a model to pick out certain objects within an image/video. For example, if I wanted to pick out a dandelion from a lush green scenery, SAM 2 would be the model to do that. Event if it wasn’t trained on something previously, it can still accurately segment the object.
Types of Image Segmentation. Credit: Mindy Support

Breakthroughs: better accuracy, less need for interactions, and faster output times

Try the demo!

1. YOLO v11

YOLO models have been around for a long time, but the most recent version includes improvements on the previous models. YOLO, short for “You Only Look Once,” is primarily and object detection model that can identify and classify an image after one look (one pass through the network). Object Detection models are sort of like a segmentation model, except Object Detection isn’t about picking out a specific object, it’s about getting the general gist of it inside a shape (rectangle and squares are the most common ones).
Difference between Image Recognition and Object Detection. Credit: SmartTek Solutions

Breakthroughs: better accuracy/precision, expanded application for Computer Vision, improved architecture, improvement in speed and performance, and advanced feature extraction.

Fun Fact: YOLO v11 can now also do instance segmentation, classification, pose estimating (understanding the pose of a human), Oriented Object Detection (more accurately detect the object in a box), and Object Tracking.
Read the paper on Arxiv!

Works Referenced

These were the works I reference while researching for this article that were not mentioned above.
“What are vision language models (VLMs)?” by Caballar and Stryker (IBM).
“How does YOLO work for object detection?” by GeeksForGeeks (last updated 01 Jul, 2024).
“SAM 2: Segment Anything in Images and Videos” by Ravi et. al. (Arxiv)
Categories

2 COMMENTS

Latest News

Jesuit Baseball Begins Season With Weatherford Invitational

After a brief period of scrimmages to start the 2026 baseball season for Jesuit, the team traveled to Weatherford,...

The Supreme Court Rebukes Donald Trump’s Tariffs

And so, we come to a final adjudication: Donald Trump's sweeping tariffs are unconstitutional. The Supreme Court justices ruled...

Six More Athletes Join a Class of 14 College Signees

On Thursday, February 5, Jesuit Athletics participated in its Winter Signing Ceremony. At this event, six seniors signed their...

2026 Texas Democratic Senate Primary Preview

The 2026 Midterm Elections are shaping up to be a blowout against the Republican Party. In the 2025 November...

Fall 2025

Jesuit Journal

To provide students interested in writing and visual art with a space to showcase their artistic talents.