Action understanding is a critical component in various computer vision applications, with traditional approaches predominantly relying on video data. More recent methods, however, utilize 2D and 3D skeleton data to improve both speed and accuracy. Despite these advancements, many models still fail to produce holistic pose representations that are transparent or semantically meaningful, either relying on end-to-end pipelines or focusing on individual keypoints. In this paper, we present a novel representation methodology by revisiting the concept of key body poses, inspired by the bag-of-words approach. Specifically, we create a dictionary of key poses and convert each action sequence into a sequence of key poses. We then explore two alternative strategies for action classification: one based on the classical bag-of-words, which focuses on the frequency of key poses, and another that considers the temporal ordering of key poses. We evaluate the effectiveness of these dictionary-based representations on the BABEL dataset, which includes 3D human keypoints and a large set of action labels. Our experimental results demonstrate that both strategies provide meaningful cues for action recognition, explicitly capturing the action complexity by balancing detail and generalization.

Revisiting Dictionaries of Key Poses for Action Representation

Matteo Moro;Federico Figari Tomenotti;Nicoletta Noceti;Francesca Odone
2026-01-01

Abstract

Action understanding is a critical component in various computer vision applications, with traditional approaches predominantly relying on video data. More recent methods, however, utilize 2D and 3D skeleton data to improve both speed and accuracy. Despite these advancements, many models still fail to produce holistic pose representations that are transparent or semantically meaningful, either relying on end-to-end pipelines or focusing on individual keypoints. In this paper, we present a novel representation methodology by revisiting the concept of key body poses, inspired by the bag-of-words approach. Specifically, we create a dictionary of key poses and convert each action sequence into a sequence of key poses. We then explore two alternative strategies for action classification: one based on the classical bag-of-words, which focuses on the frequency of key poses, and another that considers the temporal ordering of key poses. We evaluate the effectiveness of these dictionary-based representations on the BABEL dataset, which includes 3D human keypoints and a large set of action labels. Our experimental results demonstrate that both strategies provide meaningful cues for action recognition, explicitly capturing the action complexity by balancing detail and generalization.
2026
9783032101914
9783032101921
File in questo prodotto:
File Dimensione Formato  
ID-146-Moro-Matteo.pdf

accesso chiuso

Tipologia: Documento in Pre-print
Dimensione 1.7 MB
Formato Adobe PDF
1.7 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1300337
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact