The appearance of outdoor scenes changes dramatically with lighting and weather conditions, time of day, and season. Specific conditions, such as the "golden hours" characterized by warm light, can be hard to capture because many scene properties are transient -- they change over time. Despite significant advances in image editing software, common image manipulation tasks such as lighting editing require significant expertise to achieve plausible results.
In this talk, we first explore the appearance of outdoor scenes with an approach based on crowdsourcing and machine learning. We relate visual changes to scene attributes, which are human-nameable concepts used for high-level description of scenes. We collect a dataset containing thousands of outdoor images, annotate them with transient attributes, and train classifiers to recognize these properties in new images. We develop new interfaces for browsing photo collections, based on these attributes.
We then focus on specifically extracting and manipulating the lighting in a photograph. Intrinsic image decomposition separates a photograph into independent layers: reflectance, which represents the color of the materials, and illumination, which encodes the effect of lighting at each pixel. We tackle this ill-posed problem by leveraging additional information provided by multiple photographs of the scene. The methods we describe enable advanced image manipulations such as lighting-aware editing, insertion of virtual objects, and image-based illumination transfer between photographs of a collection.
This talk presents recent work from CVPR that looks at inference for pairwise CRF models in the highly (or fully) connected case rather than simply a sparse set of neighbours used ubiquitously in many computer vision tasks. Recent work has shown that fully-connected CRFs, where each node is connected to every other node, can be solved very efficiently under the restriction that the pairwise term is a Gaussian kernel over a Euclidean feature space. The method presented generalises this model to allow arbitrary, non-parametric models (which can be learnt from training data and conditioned on test data) to be used for the pairwise potentials. This greatly increases the expressive power of such models whilst maintaining efficient inference.
Bio: Neill completed his PhD at the University of Cambridge with Roberto Cipolla and at Toshiba Research, advised by Carlos Hernandez and George Vogiatzis, developing CRF (Conditional Random Field) models for multi-view object segmentation and stereo. He then moved to University College London and is now a Post-Doc with Jan Kautz and Simon Prince where he divides his time between computer graphics, machine learning and computer vision, currently working on synthesising photo-realistic objects, active learning and CRF inference.
Humans are very good at recognizing objects as well as the materials that they are made of. We can easily tell cheese from butter, silk from linen and snow from ice just by looking. Understanding material perception is important for many real-world applications. For instance, a robot cooking in the kitchen will benefit from the knowledge of material perception when deciding if food is cooked or raw. In this talk, I will present studies that are motivated by two important applications of material perception: online shopping and computer graphics (CG) rendering. First, I will discuss the image cues that allow humans to infer tactile and mechanical information about deformable materials. I will present an experiment in which subjects were asked to match their tactile and visual perception of fabrics. I will show that image cues such as 3D folds and color are important for predicting subjects' tactile perception. Not only do these findings have immediate practical implications (e.g., improving online shopping interfaces for fabrics), but they also have theoretical implications: image-based visual cues affect tactile perception. Second, I will present a project on the visual perception of translucent materials (e.g., wax, milk, and jade) using computer-rendered stimuli. Humans are very sensitive to subtle differences in translucency (e.g., baby skin vs. adult skin), however, it is difficult to render translucent materials realistically. I will show how we measured the perceptual dimensions of physical scattering parameter space and used those measurements to produce more realistic renderings of materials like marble and jade. Taken together, my findings highlight the importance of material perception in the real world, and demonstrate how human perception can contribute to applications in computer vision and graphics.
In this talk I will detail methods for simultaneous 2D/3D segmentation, tracking and reconstruction which incorporate high level shape information. I base my work on the assumption that the space of possible 2D object shapes can be either generated by projecting down known rigid 3D shapes or learned from 2D shape examples. I minimise the discrimination between statistical foreground and background appearance models with respect to the parameters governing the shape generative process (the 6 degree-of-freedom 3D pose of the 3D shape or the parameters of the learned space). The foreground region is delineated by the zero level set of a signed distance function, and I define an energy over this region and its immediate background surroundings based on pixel-wise posterior membership probabilities. I obtain the differentials of this energy with respect to the parameters governing shape and conduct searches for the correct shape using standard non-linear minimisation techniques. This methodology first leads to a novel rigid 3D object tracker. For a known 3D shape, the optimisation here aims to find the 3D pose that leads to the 2D projection that best segments a given image. I also extend my approach to track multiple objects from multiple views and show how depth (such as may be available from a Kinect sensor) can be integrated in a straighforward manner. Next, I explore deformable 2D/3D object tracking. I use a non-linear and probabilistic dimensionality reduction, called Gaussian Process Latent Variable Models, to learn spaces of shape. Segmentation becomes a minimisation of an image-driven energy function in the learned space. I can represent both 2D and 3D shapes which I compress with Fourier-based transforms, to keep inference tractable. I extend this method by learning joint shape-parameter spaces, which, novel to the literature, enable simultaneous segmentation and generic parameter recovery. These can describe anything from 3D articulated pose to eye gaze. Finally, I will also be discussing various applications of the proposed techniques, ranging from (limited) articulated hand tracking to semantic SLAM.
Sensors acquire an increasing amount of diverse information posing two challenges. Firstly, how can we efficiently deal with such a big amount of data and secondly, how can we benefit from this diversity? In this talk I will first present an approach to deal with large graphical models. The presented method distributes and parallelizes the computation and memory requirements while preserving convergence and optimality guarantees of existing inference and learning algorithms. I will demonstrate the effectiveness of the approach on stereo reconstruction from high-resolution imagery. In the second part I will present a unified framework for structured prediction with latent variables which includes hidden conditional random fields and latent structured support vector machines as special cases. This framework allows to linearly combine different sources of information and I will demonstrate its efficacy on the problem of estimating the 3D room layout given a single image. For the latter problem I will in a third part introduce a globally optimal yet efficient inference algorithm based on branch-and-bound.
Bio: Alexander Schwing studied electrical engineering and information technology at Technical University Munich (TUM) emphasizing signal processing topics. Currently he is a PhD student at ETH Zurich, supervised by Tamir Hazan (TTI-C), Marc Pollefeys (ETHZ) and Raquel Urtasun (TTI-C). His research focuses on optimization algorithms for inference and learning tasks and his work is motivated among others by applications arising from indoor 3D scene understanding topics.
Consumer level depth cameras such as Kinect have changed the landscape of 3D computer vision. In this talk we will discuss two approaches that both learn to directly infer correspondences between observed depth image pixels and 3D model points. These correspondences can then be used to drive an optimization of a generative model to explain the data. The first approach, the "Vitruvian Manifold", aims to fit an articulated 3D human model to a depth camera image, and extends our original Body Part Recognition algorithm used in Kinect. It applies a per-pixel regression forest to infer direct correspondences between image pixels and points on a human mesh model. This allows an efficient “one-shot” continuous optimization of the model parameters to recover the human pose. The second approach, "Scene Coordinate Regression", addresses the problem of camera pose relocalization. It uses a similar regression forest, but now aims to predict correspondences between observed image pixels and 3D world coordinates in an arbitrary 3D scene. These correspondences are again used to drive an efficient optimization of the camera pose to a highly accurate result from a single input frame.
Object detection is one of the main challenges of computer vision. In the standard setting, we are given an image and the goal is to place bounding boxes around the objects and recognize their classes. In robotics, estimating additional information such as accurate viewpoint or detailed segmentation is important for planning and interaction. In this talk, I'll approach detection in three scenarios: purely 2D, 3D from 2D and 3D from 3D and show how different types of information can be used to significantly boost the current state-of-the-art in detection.
Developing autonomous systems that are able to assist humans in everyday's tasks is one of the grand challenges in modern computer science. Notable examples are personal robotics for the elderly and people with disabilities, as well as autonomous driving systems which can help decrease fatalities caused by traffic accidents. In order to perform tasks such as navigation, recognition and manipulation of objects, these systems should be able to efficiently extract 3D knowledge of their environment. In this talk, I'll show how Markov random fields provide a great mathematical formalism to extract this knowledge. In particular, I'll focus on a few examples, i.e., 3D reconstruction, 3D layout estimation, 2D holistic parsing and object detection, and show representations and inference strategies that allow us to achieve state-of-the-art performance as well as several orders of magnitude speed-ups.
Raquel Urtasun is an Asssistant Professor at TTI-Chicago a philanthropically endowed academic institute located in the campus of the University of Chicago. She was a visiting professor at ETH Zurich during the spring semester of 2010. Previously, she was a postdoctoral research scientist at UC Berkeley and ICSI and a postdoctoral associate at the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. Raquel Urtasun completed her PhD at the
Computer Vision Laboratory, at EPFL, Switzerland in 2006 working with Pascal Fua and David Fleet at the University of Toronto. She has been area chair of multiple learning and vision conferences (i.e., NIPS, UAI, ICML, ICCV), and served in the committee of numerous international computer vision and machine learning conferences. Her major interests are statistical machine learning and computer vision, with a particular interest in non-parametric Bayesian statistics, latent variable models, structured prediction and their application to semantic scene understanding.
Motion capture and data driven technologies have come very far over the past few years. In terms of human capture the high volume of research that has gone into this sub group has led to very impressive results. Human motion can now be captured in real time which when used in the creative sectors can lead to blockbuster films such as Avatar. Similarly in the medical sectors these techniques can be used to diagnose, analyse performance and avoid invasive procedures in tasks such as deformity correction. There is, however, very little research on motion capture of animals. While the technology for capturing animal motion exists, the method used is inefficient, unreliable and limited, as much manual work is required to turn blocked out motions into acceptable results. How we move forward with a suitable procedure however is the major question. Do we extend the life of marker based capture or do we move towards the holy grail of markerless tracking? In this talk we look at a possible solution suitable for both possibilities through physically based simulation techniques. It is our belief that such techniques could help cross the gap in the uncanny valley as far as marker based capture is concerned but also be useful as far as markerless tracking is concerned.
Non-blind deblurring is an integral component of blind approaches for removing image blur due to camera shake. Even though learning-based deblurring methods exist, they have been limited to the generative case and are computationally expensive. To this date, manually-defined models are thus most widely used, though limiting the attained restoration quality. We address this gap by proposing a discriminative approach for non-blind deblurring. One key challenge is that the blur kernel in use at test time is not known in advance. To address this, we analyze existing approaches that use half-quadratic regularization. From this analysis, we derive a discriminative model cascade for image deblurring. Our cascade model consists of a Gaussian CRF at each stage, based on the recently introduced regression tree fields. We train our model by loss minimization and use synthetically generated blur kernels to generate training data. Our experiments show that the proposed approach is efficient and yields state-of-the-art restoration quality on images corrupted with synthetic and real blur.