Robotic manipulation in unstructured, human-centric environments presents a significant challenge for autonomous systems. A robot must be able to semantically perceive the environment, understand its 3D structure, and generate effective motions to complete a task. Each of these stages corresponds to distinct research areas, each with its own set of challenges. The goal of this thesis is to study existing approaches, understand their weaknesses, and propose novel state-of-the-art algorithms. An important part of the evaluation of the developed algorithms is their deployment on robotic platforms. The thesis is structured around three key components of robotic manipulation: 3D perception, semantic understanding, and motion generation. For each component, I identified the limitations of existing methods with respect to the target evaluation setting. I then developed novel data-driven algorithms and benchmarks to overcome these limitations and advance the state of the art in each respective area. The research conducted has led to three main contributions. First, to address the challenge of 3D perception from partial sensor data, I developed a confidence-guided shape completion algorithm. This method leverages a transformer-based HyperNetwork architecture to reconstruct objects at arbitrary resolutions while producing a confidence estimate over its output. Second, to improve semantic perception, particularly for novel or user-specified objects, I collaborated on the creation of ConCon-Chi, a benchmark for personalized vision-language tasks. This benchmark challenges existing models to learn new concepts from a few images and to compose them with known contexts, providing a robust framework for assessing concept-context compositionality. Third, to enhance robot motion generation, I proposed KDPE, a kernel density estimation-based strategy to improve the safety and reliability of policies learned via behavior cloning. KDPE filters out non-representative action trajectories generated by diffusion models at inference time, improving performance and stability both in simulation and on real-world robotic platforms. The algorithms proposed throughout this thesis were integrated into modular and end-to-end manipulation systems and extensively tested through demonstrations and experimental evaluations. This practical validation allowed me to gather insights into the trade-offs of each approach under different aspects, such as robustness, interpretability, computational cost, and flexibility. Ultimately, this thesis provides a validated set of algorithms and benchmarks that can be used to build more capable and reliable robotic systems for complex manipulation tasks.

Advancing Core Components of Robotic Manipulation New Methods for 3D Perception, Semantic Understanding, and Policy Learning

ROSASCO, ANDREA
2026-04-20

Abstract

Robotic manipulation in unstructured, human-centric environments presents a significant challenge for autonomous systems. A robot must be able to semantically perceive the environment, understand its 3D structure, and generate effective motions to complete a task. Each of these stages corresponds to distinct research areas, each with its own set of challenges. The goal of this thesis is to study existing approaches, understand their weaknesses, and propose novel state-of-the-art algorithms. An important part of the evaluation of the developed algorithms is their deployment on robotic platforms. The thesis is structured around three key components of robotic manipulation: 3D perception, semantic understanding, and motion generation. For each component, I identified the limitations of existing methods with respect to the target evaluation setting. I then developed novel data-driven algorithms and benchmarks to overcome these limitations and advance the state of the art in each respective area. The research conducted has led to three main contributions. First, to address the challenge of 3D perception from partial sensor data, I developed a confidence-guided shape completion algorithm. This method leverages a transformer-based HyperNetwork architecture to reconstruct objects at arbitrary resolutions while producing a confidence estimate over its output. Second, to improve semantic perception, particularly for novel or user-specified objects, I collaborated on the creation of ConCon-Chi, a benchmark for personalized vision-language tasks. This benchmark challenges existing models to learn new concepts from a few images and to compose them with known contexts, providing a robust framework for assessing concept-context compositionality. Third, to enhance robot motion generation, I proposed KDPE, a kernel density estimation-based strategy to improve the safety and reliability of policies learned via behavior cloning. KDPE filters out non-representative action trajectories generated by diffusion models at inference time, improving performance and stability both in simulation and on real-world robotic platforms. The algorithms proposed throughout this thesis were integrated into modular and end-to-end manipulation systems and extensively tested through demonstrations and experimental evaluations. This practical validation allowed me to gather insights into the trade-offs of each approach under different aspects, such as robustness, interpretability, computational cost, and flexibility. Ultimately, this thesis provides a validated set of algorithms and benchmarks that can be used to build more capable and reliable robotic systems for complex manipulation tasks.
20-apr-2026
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11567/1293858
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact