Active Object Recognition
Object classification is an essential supporting element for autonomous agents that need to effectively move and interact in any environment. However, realistic scenarios often present scenes with severe clutter, that dramatically degrades the performance of current object classification methods. This project mainly addresses such performance degradation with an active vision approach that improves the accuracy of 3D object classification through a next-best-view (NBV) paradigm. The next camera motion is chosen with the criteria that aim to avoid object self-occlusions while exploring as much as possible the surrounding area.
This project was a collaborative research project with PAVIS/VGM, IIT and University of Verona.
Active 3D classification of multiple objects in cluttered Scenes
Y. Wang, M. Carletti, F. Setti, M. Cristani, A. Del Bue, ICCV-W, Seoul, KR, Oct 2019 [Paper]
The overall system works as follows: At each time step the sensor acquires a depth map of the scene and we isolate the foreground by first truncating the depth within a predefined distance and then removing the plane where the objects are lying on (known as a priori).
The foreground is then segmented in object candidates and each segment is processed by a PointNet model to generate class candidates. For each segment, we take into account the top class candidates for further refinement in the pipeline. Given the segmented point cloud (for each segment) and the class candidates, we use DenseFusion to provide an initial pose estimation. Then we use the Iterative Closest Points (ICP) algorithm to perform geometric refinement among the top candidate labels and update the segment label as well as its pose.
Lastly, we select the NBV that maximises the visibility of all objects (accounting for occlusions) while avoiding already visited areas. We approximate the visibility by projecting the corners of the bounding cuboid of each segment onto the image plane. To account for occlusions, we project the segments sequentially with a z-buffer from the closest to the farther from the sensor. The number of pixels within the convex hull bounded by the projected corners of the object is used to approximate the visibility of each segment. We keep moving the sensor until the camera position reaches a fixed number of steps.
How to best aggregate multi-view information also greatly impacts the object classification performance. Different from existing methods that perform classification on the complete point cloud by first registering multi-view capturing, we propose PointView-GCN with multi-level Graph Convolutional Networks (GCNs) to hierarchically aggregate the shape features of single-view point clouds, in order to encode both the geometrical cues of an object and their multi-view relations. With experiments on our novel single-view datasets, we prove that PointView-GCN produces a more descriptive global shape feature which stably improves the classification accuracy by around 5% compared to the classifiers with single-view point clouds, and outperforms the state-of-the-art methods with the complete point clouds on ModelNet40.