In order to avoid an expensive manual labeling process or to learn object classes autonomously without human intervention, object discovery techniques have been proposed that extract visual similar objects from weakly labelled videos. However, the problem of discovering small or medium sized objects is largely unexplored. We observe that videos with activities involving human-object interactions can serve as weakly labelled data for such cases. Since neither object appearance nor motion is distinct enough to discover objects in these videos, we propose a framework that samples from a space of algorithms and their parameters to extract sequences of object proposals. Furthermore, we model similarity of objects based on appearance and functionality, which is derived from human and object motion. We show that functionality is an important cue for discovering objects from activities and demonstrate the generality of the model on three challenging RGB-D and RGB datasets.
In this talk I will discuss two related problems in 3D reconstruction: (i) recovering the 3D shape of a temporally varying non-rigid 3D surface given a single video sequence and (ii) reconstructing different instances of the same object class category given a large collection of images from that category. In both cases we extract dense 3D shape information by analysing shape variation -- in one case of the same object instance over time and in the other across different instances of objects that belong to the same class.
First I will discuss the problem of dense capture of 3D non-rigid surfaces from a monocular video sequence. We take a purely model-free approach where no strong assumptions are made about the object we are looking at or the way it deforms. We apply low rank and spatial smoothness priors to obtain dense non-rigid models using a variational approach.
Second I will describe our recent approach to populating the Pascal VOC dataset with dense, per-object 3D reconstructions, bootstrapped from class labels, ground truth figure-ground segmentations and a small set of keypoint annotations. Our proposed algorithm first estimates camera viewpoint using rigid structure-from-motion, then reconstructs objects shapes by optimizing over visual hull proposals guided by loose within-class shape similarity assumptions.
Even though many challenges remain unsolved, in recent years computer graphics algorithms to render photo-realistic imagery have seen tremendous progress. An important prerequisite for high-quality renderings is the availability of good models of the scenes to be rendered, namely models of shape, motion and appearance. Unfortunately, the technology to create such models has not kept pace with the technology to render the imagery. In fact, we observe a content creation bottleneck, as it often takes man months of tedious manual work by a animation artists to craft models of moving virtual scenes.
To overcome this limitation, the research community has been developing techniques to capture models of dynamic scenes from real world examples, for instance methods that rely on footage recorded with cameras or other sensors. One example are performance capture methods that measure detailed dynamic surface models, for example of actors or an actor's face, from multi-view video and without markers in the scene. Even though such 4D capture methods made big strides ahead, they are still at an early stage of their development. Their application is limited to scenes of moderate complexity in controlled environments, reconstructed detail is limited, and captured content cannot be easily modified, to name only a few restrictions.
In this talk, I will elaborate on some ideas on how to go beyond this limited scope of 4D reconstruction, and show some results from our recent work. For instance, I will show how we can capture more complex scenes with many objects or subjects in close interaction, as well as very challenging scenes of a smaller scale, such a hand motion. The talk will also show how we can capitalize on more sophisticated light transport models and inverse rendering to enable high-quality reconstruction in much more uncontrolled scenes, eventually also outdoors, and with very few cameras. I will also demonstrate how to represent captured scenes such that they can be conveniently modified. If time allows, the talk will cover some of our recent ideas on how to perform advanced edits of videos (e.g. removing or modifying dynamic objects in scenes) by exploiting reconstructed 4D models, as well as robustly found inter- and intra-frame correspondences.
Christian Theobalt is a Professor of Computer Science and the head of the research group "Graphics, Vision, & Video" at the Max-Planck-Institute for Informatics, Saarbruecken, Germany. From 2007 until 2009 he was a Visiting Assistant Professor in the Department of Computer Science at Stanford University. He received his MSc degree in Artificial Intelligence from the University of Edinburgh, Scotland, and his Diplom (MS) degree in Computer Science from Saarland University, in 2000 and 2001 respectively. In 2005, he received his PhD (Dr.-Ing.) from Saarland University and Max-Planck-Institute for Informatics.
Most of his research deals with algorithmic problems that lie on the boundary between the fields of Computer Vision and Computer Graphics, such as dynamic 3D scene reconstruction and marker-less motion capture, computer animation, appearance and reflectance modelling, machine learning for graphics and vision, new sensors for 3D acquisition, advanced video processing, as well as image- and physically-based rendering.
For his work, he received several awards, including the Otto Hahn Medal of the Max-Planck Society in 2007, the EUROGRAPHICS Young Researcher Award in 2009, and the German Pattern Recognition Award 2012. Further, in 2013 he was awarded an ERC Starting Grant by the European Union. He is a Principal Investigator and a member of the Steering Committee of the Intel Visual Computing Institute in Saarbruecken. He is also a co-founder of a spin-off company from his group - www.thecaptury.com - that is commercializing a new generation of marker-less motion and performance capture solutions.
A goal in virtual reality is for the user to experience a synthetic environment as if it were real. Engagement with virtual actors is a big part of the sensory context, thus getting the people "right" is critical for success. Size, shape, gender, ethnicity, clothing, color, texture, movement, among other attributes must be layered and nuanced to provide an accurate encounter between an actor and a user. In this talk, I discuss the development of digital human models and how they may be improved to obtain the high realism for successful engagement in a virtual world.
Volumetric 3D modeling has attracted a lot of attention in the past. In this talk I will explain how the standard volumetric formulation can be extended to include semantic information by using a convex multi-label formulation. One of the strengths of our formulation is that it allows us to directly account for the expected surface orientations. I will focus on two applications. Firstly, I will introduce a method that allows for joint volumetric reconstruction and class segmentation. This is achieved by taking into account the expected orientations of object classes such as ground and building. Such a joint approach considerably improves the quality of the geometry while at the same time it gives a consistent semantic segmentation. In the second application I will present a method that allows for the reconstruction of challenging objects such as for example glass bottles. The main difficulty with reconstructing such objects are the texture-less, transparent and reflective areas in the input images. We propose to formulate a shape prior based on the locally expected surface orientation to account for the ambiguous input data. Our multi-label approach also directly enables us to segment the object from its surrounding.
Christian Häne received the BSc and MSc degrees in computer science from ETH Zürich in 2010 and 2011, respectively. He is currently a graduate student at ETH Zürich in the Computer Vision and Geometry Group, under the supervision of Marc Pollefeys. In 2013 he did a three months summer internship at Microsoft Research in Cambridge, UK. His research interests include convex methods for dense 3D reconstruction and the application of these methods to challenging scenarios. He is also interested in real-time implementations of computer vision algorithms using GPGPU.
The goal of lifelong visual learning is to develop techniques that continuously and autonomously learn from visual data, potentially for years or decades. During this time the system should build an ever-improving base of generic visual information, and use it as background knowledge and context for solving specific computer vision tasks. In my talk, I will highlight two recent results from our group on the road towards lifelong visual scene understanding: the derivation of theoretical guarantees for lifelong learning systems and the development of practical methods for object categorization based on semantic attributes.
Point-light walkers and stick figures rendered orthographically and without self-occlusion do not contain any information as to their depth. For instance, a frontoparallel projection could depict a walker from the front or from the back. Nevertheless, observers show a strong bias towards seeing the walker as facing the viewer. A related stimulus, the silhouette of a human figure, does not seem to show such a bias. We develop these observations into a tool to study the cause of the facing the viewer bias observed for biological motion displays.
I will give a short overview about existing theories with respect to the facing-the-viewer bias, and about a number of findings that seem hard to explain with any single one of them. I will then present the results of our studies on both stick figures and silhouettes which gave rise to a new theory about the facing the viewer bias, and I will eventually present an experiment that tests a hypothesis resulting from it. The studies are discussed in the context of one of the most general problems the visual system has to solve: How do we disambiguate an initially ambiguous sensory world and eventually arrive at the perception of a stable, predictable "reality"?
Compared to static image segmentation, video segmentation is still in its infancy. Various research groups have different tasks in mind when they talk of video segmentation. For some it is motion segmentation, some think of an over-segmentation with thousands of regions per video, and others understand video segmentation as contour tracking. I will go through what I think are reasonable video segmentation subtasks and will touch the issue of benchmarking. I will also discuss the difference between image and video segmentation. Due to the availability of motion and the redundancy of successive frames, video segmentation should actually be easier than image segmentation. However, recent evidence indicates the opposite: at least at the level of superpixel segmentation, image segmentation methodology is more advanced than what can be found in the video segmentation literature.
Thomas Brox received his Ph.D. in computer science from the Saarland University, Saarbrücken, Germany in 2005. During his studies he spent three months as a visiting researcher at the INRIA Sophia-Antipolis, France. After his Ph.D. he joined the Computer Vision Group at the University of Bonn. From October 2007 to October 2008 he headed the Intelligent Systems Group at the University of Dresden as a temporary faculty member. After two years as a postdoctoral fellow in the Computer Vision Group of Jitendra Malik at U.C. Berkeley he moved to the University of Freiburg, where he is heading the Computer Vision Group. Prof. Brox is associate editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence and the Image and Vision Computing journal. He was/is an area chair of ICCV 2011, ACCV 2014 and ECCV 2014, and reviews for several funding organizations. In 2004, he received the Longuet-Higgins Best Paper Award at the European Conference on Computer Vision. In 2011 he was awarded an ERC starting grant. He is interested in all aspects of computer vision with a focus on video analysis (optical flow estimation, video segmentation, and learning from videos).
In the first part of our talk, we present an approach for large displacement optical flow. Optical flow computation is a key component in many computer vision systems designed for tasks such as action
detection or activity recognition. Inspired by the large displacement optical flow of Brox and Malik, our approach DeepFlow combines a novel matching algorithm with a variational approach . Our matching algorithm builds upon a multi-stage architecture interleaving convolutions and max-pooling. DeepFlow efficiently handles large displacements occurring in realistic videos, and shows competitive performance on optical flow benchmarks.
In the second part of our talk, we present a state-of-the-art approach for action recognition based on motion stabilized trajectory descriptors and a Fisher vector representation. We briefly review the recent trajectory-based video features and, then, introduce their motion stabilized version, combining human detection and dominant motion estimation. Fisher vectors summarize the information of a video efficiently. Results on several of the recent action datasets as well as the TrecVid MED dataset show that our approach outperforms the state-of-the-art
Computer vision problems often involve optimization of two quantities, one of which is time. Such problems can be formulated as time-constrained optimization or performance-constrained search for the fastest algorithm. We show that it is possible to obtain quasi-optimal time-constrained solutions to some vision problems by applying Wald's theory of sequential decision-making. Wald assumes independence of observation, which is rarely true in computer vision. We address the problem by combining Wald's sequential probability ratio test and AdaBoost. The solution, called the WaldBoost, can be viewed as a principled way to build a close-to-optimal “cascade of classifiers” of the Viola-Jones type. The approach will be demonstrated on four tasks: (i) face detection, (ii) establishing reliable correspondences between image, (iii) real-time detection of interest points and (iv) model search and outlier detection using RANSAC. In the face detection problem, the objective is learning the fastest detector satisfying constraints on false positive and false negative rates. The correspondence pruning addresses the problem of fast selection with a predefined false negative rated. In interest point problem we show how a fast implementation of known detectors can obtained by Waldboost. The “mimicked” detectors provide a training set of positive and negative examples of interest ponts and WaldBoost learns a detector, (significantly) faster than the providers of the training set, formed as a linear combination of efficiently computable feature. In RANSAC, we show how to exploit Wald's test in a randomised model verification procedure to obtain an algorithm significantly faster than deterministic verification yet with equivalent probabilistic guarantees of correctness.