Alexander Martin

I am a first-year Ph.D. student at Johns Hopkins University, advised by Dr. Ben Van Durme. I am broadly interested in natural language processing and computer vision, especially towards advancing end-to-end AI assisted article writing and reasoning using multimodal content. The core of my current research focuses on generating text (articles) that are grounded against both text and videos. I have published on:

Retrieving information from videos in multilingual real-world settings by efficient retrieval models.
Grounding information in cross-document and video-text settings.
Writing articles from multiple videos, document, and cross-document

Before Johns Hopkins, I got my B.S. from the University of Rochester advised by Dr. Jiebo Luo and Dr. Aaron Steven White.

[Resume]

news

Feb 26, 2025	2/2 for papers at CVPR 2025!
Aug 26, 2024	Starting Ph.D. at JHU

selected publications

WikiVideo: Article Generation from Multiple Videos

Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, and Benjamin Van Durme

2025

PDF
Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval

Arun Reddy^*, Alexander Martin^*, Eugene Yang, Andrew Yates, Kate Sanders, Kenton Murray, Reno Kriz, Celso M Melo, Benjamin Van Durme, and Rama Chellappa

In IEEE Conference on Computer Vision and Pattern Recognition, Jun 2025

Abs PDF

In this work, we tackle the problem of text-to-video re- trieval (T2VR). Inspired by the success of late interaction techniques in text-document, text-image, and text-video re- trieval, our approach, Video-ColBERT, introduces a simple and efficient mechanism for fine-grained similarity assess- ment between queries and videos. Video-ColBERT is built upon three main components: a fine-grained spatial and temporal token-wise interaction, query and visual expan- sions, and a dual sigmoid loss during training. We find that this interaction and training paradigm leads to strong in- dividual, yet compatible representations for encoding video content. These representations lead to increases in perfor- mance on common text-to-video retrieval benchmarks com- pared to other bi-encoder methods.
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval

Reno Kriz, Kate Sanders, David Etter, Kenton Murray, Cameron Carpenter, Kelly Van Ochten, Hannah Recknor, Jimena Guallar-Blasco, Alexander Martin, Ronald Colaianni, Nolan King, Eugene Yang, and Benjamin Van Durme

In IEEE Conference on Computer Vision and Pattern Recognition, Jun 2025

Abs PDF

Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge. However, existing video retrieval datasets suffer from scope limitations, primarily focusing on matching descriptive but vague queries with small collections of professionally edited, English-centric videos. To address this gap, we introduce MultiVENT 2.0, a large-scale, multilingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news videos and 3,906 queries targeting specific world events. These queries specifically target information found in the visual content, audio, embedded text, and text metadata of the videos, requiring systems leverage all these sources to succeed at the task. Preliminary results show that state-of-the-art vision-language models struggle significantly with this task, and while alternative approaches show promise, they are still insufficient to adequately address this problem. These findings underscore the need for more robust multimodal retrieval systems, as effective video retrieval is a crucial step towards multimodal content understanding and generation.
Grounding Partially-Described Events in Multimodal Data

Kate Sanders, Reno Kriz, David Etter, Hannah Recknor, Alexander Martin, Cameron Carpenter, Jingyang Lin, and Benjamin Van Durme

In Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs PDF

Organizing perceived change into events is a key element of human cognition, and so to understand data as humans do, AI systems must model events of human interest. While natural language enables straightforward ways to represent complex events, visual data does not facilitate analogous methods and, consequently, introduces unique challenges in event understanding. To tackle complex event modeling in multimodal settings, we introduce a multimodal formulation for arbitrarily complex events and cast the extraction of these events as a three-stage span retrieval task. We propose a corresponding benchmark for this task, MultiVENT Grounded, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. We propose a collection of LLM-driven approaches to the task of multimodal event analysis, and evaluate both on MultiVENT Grounded. Results illustrate the challenges that abstract event understanding in noisy content poses while also demonstrating promise in event-centric video-language systems.