Action understanding is a critical component in various computer vision applications, with traditional approaches predominantly relying on video data. More recent methods, however, utilize 2D and 3D skeleton data to improve both speed and accuracy. Despite these advancements, many models still fail to produce holistic pose representations that are transparent or semantically meaningful, either relying on end-to-end pipelines or focusing on individual keypoints. In this paper, we present a novel representation methodology by revisiting the concept of key body poses, inspired by the bag-of-words approach. Specifically, we create a dictionary of key poses and convert each action sequence into a sequence of key poses. We then explore two alternative strategies for action classification: one based on the classical bag-of-words, which focuses on the frequency of key poses, and another that considers the temporal ordering of key poses. We evaluate the effectiveness of these dictionary-based representations on the BABEL dataset, which includes 3D human keypoints and a large set of action labels. Our experimental results demonstrate that both strategies provide meaningful cues for action recognition, explicitly capturing the action complexity by balancing detail and generalization.
Revisiting Dictionaries of Key Poses for Action Representation
Matteo Moro;Federico Figari Tomenotti;Nicoletta Noceti;Francesca Odone
2026-01-01
Abstract
Action understanding is a critical component in various computer vision applications, with traditional approaches predominantly relying on video data. More recent methods, however, utilize 2D and 3D skeleton data to improve both speed and accuracy. Despite these advancements, many models still fail to produce holistic pose representations that are transparent or semantically meaningful, either relying on end-to-end pipelines or focusing on individual keypoints. In this paper, we present a novel representation methodology by revisiting the concept of key body poses, inspired by the bag-of-words approach. Specifically, we create a dictionary of key poses and convert each action sequence into a sequence of key poses. We then explore two alternative strategies for action classification: one based on the classical bag-of-words, which focuses on the frequency of key poses, and another that considers the temporal ordering of key poses. We evaluate the effectiveness of these dictionary-based representations on the BABEL dataset, which includes 3D human keypoints and a large set of action labels. Our experimental results demonstrate that both strategies provide meaningful cues for action recognition, explicitly capturing the action complexity by balancing detail and generalization.| File | Dimensione | Formato | |
|---|---|---|---|
|
ID-146-Moro-Matteo.pdf
accesso chiuso
Tipologia:
Documento in Pre-print
Dimensione
1.7 MB
Formato
Adobe PDF
|
1.7 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



