Top 3 picks from ISMAR 2018

The R&D team of TWNKLS attended ISMAR once again this year, the leading international academic conference in the fields of Augmented/Virtual/Mixed Reality. This is the right place to glimpse at technologies which will be used in the AR applications of tomorrow. In this blog, we share the highlights we saw at ISMAR 2018.

Advanced remote assistence

With a relatively low investment and clear benefits, Remote assistance/collaboration is certainly a big use case for enterprise AR. More and more companies show AR add-ons to standard “Skype” communication (audio-video + chat). This starts from a simple image-freeze from a video feed where an expert can draw some annotations in a single image and then send it back to an operator. More mature solutions let you draw the annotations over the live video from the operator and they stick to the actual objects. Depending on tracking technology, the annotations just trace the original target in 2D or they are actually locked to a 3D map of the environment. Alternatively, an expert can use their hands to point at things in the operators view if they are captured by their own camera. The segmented hands from their own video are transmitted back and overlaid on top of the operator’s video.

All of these approaches have a common denominator, the expert examines the situation through the eyes/camera of the operator. However, a demo by DAQRI [1] showed that the current technology allows for more flexibility. There are two novel aspects demonstrated in the video below. First is a free viewpoint of the expert on a scene independent from where the operator is currently looking at. Secondly, actual objects in the scene are used to guide the operator. This prototype was shown using DAQRI Smart Glasses which combine optical see-through HMD and a pocket laptop which runs all the processing. Tracking of HMD and mapping of the scene is done by RGBD SLAM which utilizes colour cameras and depth sensor on the HMD. So the operator looking around the scene effectively creates a textured 3D model of what they see. Instead of just transmitting a standard video to a computer of the remote expert, the whole 3D mesh is transferred. This allows the expert to detach from the first-person view of the operator and examine the model freely from any perspective. They can make a better assessment and annotate the situation accurately in 3D. All of this is simpler compared to the ‘unstable’ operator view.

A truly new idea is to use animations of the visible objects to complement typical hand-drawn instructions. This is possible because the dense scene model is available on the expert’s laptop. They can select an object by a simple stroke. A part of the model is segmented based on colour similarity and geometric discontinuities. The resulting 3D mesh of the object is partial and has coarse edges due to a simple extraction method. But, it is still effective to animate it along a trajectory to instruct the next action. This animation is send back to the operator who can see it in their HMD at an appropriate location. This approach requires on-the-fly streaming of textured mesh data which is more demanding than a usual video. The demo worked over a local network, so there is some development required to make it a true remote collaboration tool

The previous use case could be combined in the future with a recognition+mapping approach such as the one presented by UCL [2]. It is a SLAM system based on a RGBD sensor but combined with object recognition. In the following video, you can see that individual objects of certain categories (teddy bear, bottle, keyboard etc.) are recognised as a user is mapping the scene. Moreover, they are individually tracked as they are moved around. This relaxes the common assumption of SLAM or VIO methods that the scene is static and only the camera is moving.

The object tracking provides a full six-degrees-of-freedom trajectory of each 3D model. The model also improves over time as the object is seen from different views and reveals new parts of its surface. Effectively, there is a KinectFusion-style 3D reconstruction [6] happening for the background scene and each recognized object. The problem is how to identify which regions of a current image belong to the objects of interest. Unsurprisingly, deep learning is leveraged to provide per-pixel segmentation of individual objects. Mask-RNN network [3] is used to identify around 80 unique object categories. The segmentation masks are provided with every few frames (the network runs at 5fps). To deal with the speed limitation and imprecision of segmentation at boundaries, the available masks are propagated to adjacent frames and refined in real time according to geometric discontinuities in the depth stream. The whole system has real-time performance, but it requires two Nvidia GTX Titan X graphics cards. Nevertheless, the 3D scene representation with semantics and dynamics opens new AR possibilities. Information specific to an object category can be automatically attached to an object. As the object moves, the augmentations are still locked to it regardless of a simultaneous camera motion. Also, it is possible to remove certain objects from the scene model if their image regions are omitted from the mapping (for instance, complex dynamic objects like people).

Klik hier voor de gratis AR Management handleiding en leer de essentials die nodig zijn voor het bouwen van een sterke Augmented Reality-strategie.

Collaborative mapping

To continue on the topic of scene 3D reconstruction, a team from University of Oxford presented an interesting paper and demo on collaborative large-scale dense 3D reconstruction [4]. A cheap and fast method of mapping larger spaces such as factory floors, public buildings is often desired for AR navigation or large-scale augmentations. The proposed method allows to map whole buildings with consumer-grade hardware and faster than it was possible before. They leverage recent advances in VIO approaches which are able to make live reconstructions with a low drift. Each user maps a part of environment using Asus ZenPhone (Google Tango device) in parallel. These mobile clients are connected to a central mapping server on a laptop. Individual submaps are reconstructed by integration of incoming posed RGBD frames into voxel grids.

The key contribution is to merge the submaps into a consistent global model on the fly. This intuitively relies on an assumption that the collaborating users will partly cover areas which have been visited by others as well. This creates overlaps between the submaps which allows to establish relative 3D transforms between pairs of the submaps. To identify the overlaps a camera relocaliser is learned online for each submap. The relocaliser is able to take a single RGBD frame and compute a corresponding camera pose in its submap. This regression is learned very accurately by random forests (surprisingly not by deep nets). Once you have the relocaliser for a submap A, a randomly selected RGBD images from a submap B can be tentatively localized in the submap A. If a valid camera pose is computed in the submap A, this means that there is an overlap between A and B. Because a camera pose of the input frame is also known in the submap B, one can calculate a relative transform between the maps. To achieve higher robustness, several candidate transforms are obtained using different camera poses in the overlapping space. These candidates are clustered and a single spatial link is calculated for this pair of submaps.

As multiple submap links are appearing over time, they are passed as relative constraints into a so-called pose graph optimization. This process continuously refines global poses of individual submaps in the overall 3D model on the mapping server. When the reconstruction process is finished, the globally positioned submaps from individual users can be fused into a single consistent map without redundancy in the overlap regions. In terms of system performance, the server with NVIDIA Titan X is able to handle up to 11 users in real time with GPU memory usage being the main bottleneck. The idea of this project to make large-scale 3D reconstruction simpler and more accessible to budget-conscious users has been hampered by discontinuation of Google Tango. However, a switch to a different RGBD sensor on a suitable mobile device is easy from algorithmic point of view. Alternative avenues are opening up with deep nets inferring depth from standard RGB images as discussed in the next section.

This technology could be used for a lot of different cases, one of them being our research into inside navigation.

Deep learning for depth perception

All works mentioned so far reconstruct dense 3D models of environments. This is common because it enables more complex interactions between real world and AR objects such as occlusions, shadows, collisions (see the figure below). To obtain a dense reconstruction, the standard approach is to use SLAM system with an RGBD sensor. Depth information is critical to overcome limitations of monocular RGB techniques such as scale drift, artefacts in low-texture regions or instability in pure rotation movements. However, a reach of this approach is limited to some high-end mobile phones, AR HMDs and specialised devices (Kinect, RealSense, etc.). Alternatively, recent advances in deep learning enabled a satisfactory depth estimation from colour images. This means that any colour camera can have depth sensing without additional sensors.

All works mentioned so far reconstruct dense 3D models of environments. This is common because it enables more complex interactions between real world and AR objects such as occlusions, shadows, collisions (see the figure below). To obtain a dense reconstruction, the standard approach is to use SLAM system with an RGBD sensor. Depth information is critical to overcome limitations of monocular RGB techniques such as scale drift, artefacts in low-texture regions or instability in pure rotation movements. However, a reach of this approach is limited to some high-end mobile phones, AR HMDs and specialised devices (Kinect, RealSense, etc.). Alternatively, recent advances in deep learning enabled a satisfactory depth estimation from colour images. This means that any colour camera can have depth sensing without additional sensors.

The accuracy of depth estimates is not good enough yet to use it for the standard KinectFusion reconstruction [6]. However, the deep net can be loosely combined with any monocular SLAM method. The authors chose to integrate with a variant of ORB-SLAM [7] which can utilise depth maps in camera tracking. A camera pose is estimated online by SLAM at every frame and a corresponding depth map is placed accordingly in global 3D space. The depth map is converted into a frame point cloud which is fused into overall model. The point cloud fusion filters duplicates and noisy points with every new image. At the end, the final point cloud is converted into smooth 3D model of a scene. As you can see on the picture, the example model has a decent quality. This is great given the fact that depth sensing happens at very low resolution (120x80pix). Shown interaction of real furniture with AR cubes is believable (accurate placement on the bed, occlusions by the bed). This method runs real-time however, depth perception needs a powerful NVIDIA Titan X card. It will be exciting to see this running with a standard camera on a mobile phone in future.

There were other interesting papers and demos at ISMAR this year. Also, the keynotes were engaging with many useful insights, so have a look online. We picked this selection because they are relevant to use cases we encounter in our business.

References

[1] Zillner et al. – Augmented Reality Remote Collaboration with Dense  Reconstruction. ISMAR Demos, 2018.
[2] Runz et al. – MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects. ISMAR, 2018.
[3] He et al. – Mask R-CNN. ICCV, 2017.
[4] Golodetz et al. – Collaborative Large-Scale Dense 3D Reconstruction with Online Inter-Agent Pose Optimisation. ISMAR, 2018.
[5] Wang et al. – CNN-MonoFusion: Online Monocular Dense Reconstruction using Learned Depth from Single View. ISMAR, 2018.
[6] Newcombe et al. – KinectFusion: Real-time dense surface mapping and tracking. ISMAR, 2011.
[7] Mur-Artal et al. – ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. Transactions on Robotics, Vol. 33, 2017.