Determining the visual focus of attention of people in a scene is a fundamental cue to understand social interactions from videos. Gaze direction is ideal for determining eye contact, a basic cue of non-verbal communication, but it is not always easy to recognise. Head direction is a well-known proxy of gaze direction, more robust to the variability of the scene, thus offering a valuable alternative. In this work, we consider HHP-net, a method for estimating the head direction from single frames based on a heteroscedastic neural network to estimate people’s head pose from a minimal set of head key points. We formulate the problem as a multi-task regression, to predict the pose as a triplet of Euler angles from the output of a 2D pose estimator. HHP-net also provides a measure of the aleatoric heteroscedastic uncertainties associated with the angles, through an ad-hoc loss function we introduce. In a thorough experimental analysis, we show that our model is efficient and effective compared with the state of the art, with only 2 degrees of degradation in the worst case counterbalanced by a space occupation 12 times smaller. We also show the beneficial effects of uncertainty on interpretability. Finally, we discuss the robustness of our method to input variability, showing that it can be seen as a plug-in to different pose estimators. As a proof-of-concept, we address social interaction analysis, with an algorithm to detect dyadic interactions in images.

Head pose estimation with uncertainty and an application to dyadic interaction detection

Figari Tomenotti, Federico;Noceti, Nicoletta;Odone, Francesca
2024-01-01

Abstract

Determining the visual focus of attention of people in a scene is a fundamental cue to understand social interactions from videos. Gaze direction is ideal for determining eye contact, a basic cue of non-verbal communication, but it is not always easy to recognise. Head direction is a well-known proxy of gaze direction, more robust to the variability of the scene, thus offering a valuable alternative. In this work, we consider HHP-net, a method for estimating the head direction from single frames based on a heteroscedastic neural network to estimate people’s head pose from a minimal set of head key points. We formulate the problem as a multi-task regression, to predict the pose as a triplet of Euler angles from the output of a 2D pose estimator. HHP-net also provides a measure of the aleatoric heteroscedastic uncertainties associated with the angles, through an ad-hoc loss function we introduce. In a thorough experimental analysis, we show that our model is efficient and effective compared with the state of the art, with only 2 degrees of degradation in the worst case counterbalanced by a space occupation 12 times smaller. We also show the beneficial effects of uncertainty on interpretability. Finally, we discuss the robustness of our method to input variability, showing that it can be seen as a plug-in to different pose estimators. As a proof-of-concept, we address social interaction analysis, with an algorithm to detect dyadic interactions in images.
File in questo prodotto:
File Dimensione Formato  
CVIU_Tomenotti2024.pdf

accesso aperto

Tipologia: Documento in Post-print
Dimensione 2.84 MB
Formato Adobe PDF
2.84 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1173015
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 2
social impact