# Talks

## Patch-Based Image Processing

## Patch-Based Image Processing

Within the wide field of sparse approximation, convolutional sparse coding (CSC) has gained increasing attention in recent years. This model assumes a structured-dictionary built as a union of banded Circulant matrices. Most of the attention has been devoted to the practical side of CSC, proposing efficient algorithms for the pursuit problem, and identifying applications that benefit from this model. Interestingly, a systematic theoretical understanding of CSC seems to have been left aside, with the assumption that the existing classical results are sufficient. In this talk we start by presenting a novel analysis of the CSC model and its as- sociated pursuit. Our study is based on the observation that while being global, this model can be characterized and analyzed locally.

We show that uniqueness of the representation, its stability with respect to noise, and successful greedy or convex recovery are all guaranteed assuming that the underlying representation is locally sparse. These new results are much stronger and informative, compared to those obtained by deploying the classical sparse theory. Armed with these new insights, we proceed by proposing a multi-layer extension of this model, ML-CSC, in which signals are assumed to emerge from a cascade of CSC layers. This, in turn, is shown to be tightly connected to Convolutional Neural Networks (CNN), so much so that the forward-pass of the CNN is in fact the Thresholding pursuit serving the ML-CSC model. This connection brings a fresh view to CNN, as we are able to attribute to this architecture theoretical claims such as uniqueness of the representations throughout the network, and their stable estimation, all guaranteed under simple local sparsity conditions. Lastly, identifying the weaknesses in the above scheme, we propose an alternative to the forward-pass algorithm, which is both tightly connected to deconvolutional and recurrent neural networks, and has better theoretical guarantees.

Style-transfer is a process of migrating a style from a given image to the content of another, synthesizing a new image which is an artistic mixture of the two. Recent work on this problem adopting Convolutional Neural-networks (CNN) ignited a renewed interest in this field, due to the very impressive results obtained. There exists an alternative path towards handling the style-transfer task, via generalization of texture-synthesis algorithms. I will present a novel such style-transfer algorithm that extends the texture-synthesis work of Kwatra et. al. (2005), while aiming to get stylized images that get closer in quality to the CNN ones.

Recent work in image processing repeatedly shows highly efficient reconstruction algorithms that lean on modeling of small overlapping patches. Such methods impose a local model in order to regularize a global inverse problem. Why does this work so well? Does this leave room for improvements? What does a local model imply globally on the unknown signal? In this talk we will start from algorithmic attempts that aim to understand this dichotomy in order to narrow the global-local gap. Gradually, we will turn the discussion to a theoretical point of view that provides a deeper understanding of such local models, and their global implications.

Compression of frontal facial images is an appealing and important application. Recent work has shown that specially tailored algorithms for this task can lead to performance far exceeding JPEG2000. This paper proposes a novel such compression algorithm, exploiting our recently developed redundant tree-based wavelet transform. Originally meant for functions defined on graphs and cloud of points, this new transform has been shown to be highly effective as an image adaptive redundant and multi-scale decomposition. The key concept behind this method is reordering of the image pixels so as to form a highly smooth 1D signal that can be sparsified by a regular wavelet. In this work we bring this image adaptive transform to the realm of compression of aligned frontal facial images. Given a training set of such images, the transform is designed to best sparsify the whole set using a common feature-ordering. Our compression scheme consists of sparse coding using the transform, followed by entropy coding of the obtained coefficients. The inverse transform and a post-processing stage are used to decode the compressed image. We demonstrate the performance of the proposed scheme and compare it to other competing algorithms.

What if we take all the overlapping patches from a given image and organize them to create the shortest path by using their mutual Euclidean distances? This suggests a reordering of the image pixels in a way that creates a maximal 1D regularity. What could we do with such a construction? In this talk we consider a wider perspective of the above, and introduce a wavelet transform for graph-structured data. The proposed transform is based on a 1D wavelet decomposition coupled with a pre-reordering of the input so as to best sparsify the given data. We adopt this transform to image processing tasks by considering the image as a graph, where every patch is a node, and edges are obtained by Euclidean distances between corresponding patches. We show several ways to use the above ideas in practice, leading to state-of-the-art image denoising, deblurring, inpainting, and face-image compression results.

Images are 2D signals, and should be processed as such — this is the common belief in the image processing community. Is it truly the case? Around thirty years ago, some researchers suggested to convert images into 1D signals, so as to harness well-developed 1D tools such as adaptive-filtering and Kalman- estimation techniques. These attempts resulted with poorly performing algorithms, strengthening the above belief. Why should we force unnatural causality between spatially ordered pixels? Indeed, why? In this talk I will present a conversion of images into 1D signals that leads to state-of-the-art results in series of applications – denoising, inpainting, compression, and more. The core idea in our work is that there exists a permutation of the image pixels that carries in it most of the “spatial content”, and this ordering is within reach, even if the image is corrupted. We expose this permutation and use it in order to process the image as if it is a one-dimensional signal, treating successfully a series of image processing problems.

Images, video, audio, text documents, financial data, medical information, traffic info – all these and many others are data sources that can be effectively processed. Why? Is it obvious? In this talk we will start by discussing “modeling” of data as a way to enable their actual processing, putting emphasis on sparsity-based models. We will turn our attention to graph-structured data and propose a tailored sparsifying transform for its dimensionality reduction and subsequent processing. We shall conclude by showing how this new transform becomes relevant and powerful in revisiting … classical image processing tasks..

What if we take all the overlapping patches from a given image and organize them to create the shortest path by using their mutual distances? This suggests a reordering of the image pixels in a way that creates a maximal 1D regularity. What could we do with such a construction? In this talk we consider a wider perspective of the above, and introduce a wavelet transform for graph-structured data. The proposed transform is based on a 1D wavelet decomposition coupled with a pre-reordering of the input so as to best sparsify the given data. We adopt this transform to image processing tasks by considering the image as a graph, where every patch is a node, and edges are obtained by Euclidean distances between corresponding patches. We show several ways to use the above ideas in practice, leading to state-of-the-art image denoising, deblurring, and inpainting results.

Images, video, audio, text documents, financial data, medical information, traffic info — all these and many others are data sources that can be effectively processed. Why? Is it obvious? In this talk we will start by discussing “modeling” of data as a way to enable their actual processing, putting emphasis on sparsity-based models. We will turn our attention to graph-structured data and propose a tailored sparsifying transform for its dimensionality reduction and subsequent processing. We shall conclude by showing how this new transform becomes relevant and powerful in revisiting … classical image processing tasks.

In this talk we describe the co-sparse analysis model, with emphasis on pursuit algorithms and dictionary learning for it. We present two of our recent activities on this subject: (i) A theoretical study of the Analysis-Thresholding algorithm, exposing measures of goodness for the dictionary that govern the pursuit performance; and (ii) The development of an analysis K-SVD algorithm that trains a dictionary from signal examples and its use for image denoising.

In this talk we describe the co-sparse analysis model, with emphasis on pursuit algorithms and dictionary learning for it. We present two of our recent activities on this subject: (i) A theoretical study of the Analysis-Thresholding algorithm, exposing measures of goodness for the dictionary that govern the pursuit performance; and (ii) The development of an analysis K-SVD algorithm that trains a dictionary from signal examples and its use for image denoising.

The synthesis-based sparse representation model for signals has drawn a considerable interest in the past decade. Such a model assumes that the signal of interest can be decomposed as a linear combination of a few atoms from a given dictionary. In this talk we concentrate on an alternative, analysis-based model, where an analysis operator — hereafter referred to as the “Analysis Dictionary” – multiplies the signal, leading to a sparse outcome. While the two alternative models seem to be very close and similar, they are in fact very different. In this talk we define clearly the analysis model and describe how to generate signals from it. We discuss the pursuit denoising problem that seeks the zeros of the signal with respect to the analysis dictionary given noisy measurements. Finally, we explore ideas for learning the analysis dictionary from a set of signal examples. We demonstrate this model’s effectiveness in several experiments, treating synthetic data and real images, showing a successful and meaningful recovery of the analysis dictionary.

The synthesis-based sparse representation model for signals has drawn a considerable interest in the past decade. Such a model assumes that the signal of interest can be decomposed as a linear combination of a few atoms from a given dictionary. In this talk we concentrate on an alternative, analysis-based model, where an analysis operator — hereafter referred to as the “Analysis Dictionary” – multiplies the signal, leading to a sparse outcome. While the two alternative models seem to be very close and similar, they are in fact very different. In this talk we define clearly the analysis model and describe how to generate signals from it. We discuss the pursuit denoising problem that seeks the zeros of the signal with respect to the analysis dictionary given noisy measurements. Finally, we explore ideas for learning the analysis dictionary from a set of signal examples. We demonstrate this model’s effectiveness in several experiments, treating synthetic data and real images, showing a successful and meaningful recovery of the analysis dictionary.

Multi-channel TV broadcast, Internet video and You-Tube, home DVD movies, video conference calls, cellular video calls and more – there is no doubt that videos are abundant and in everyday use. In many cases, the quality of the available video is poor, something commonly referred to as “low-resolution”. As an example, High-definition (HD) TV’s are commonly sold these days to customers that hope to enjoy a better viewing experience. Nevertheless, most TV broadcast today is still done in standard-definition (SD), leading to poor image quality on these screens. The field of Super-Resolution deals with ways to improve video content to increase optical resolution. The core idea: fusion of the visual content in several images can be performed and this can lead to a better resolution outcome. For years it has been assumed that such fusion requires knowing the exact motion the objects undergo within the scene. Since this motion may be quite complex in general, this stood as a major obstacle for industrial applications. Three years ago a break-through has been made in this field, allowing to bypass the need for exact motion estimation. In this lecture we shall survey the work in this field from its early days (25 years ago) and till very recently, and show the evolution of ideas and results obtained. No prior knowledge in image processing is required.

This course (4 lectures and one tutorial) brings the core ideas and achievements made in the field of sparse and redundant representation modeling, with emphasis on the impact of this field to image processing applications. The five lectures (given as PPTX and PDF) are organized as follows:

Lecture 1: The core sparse approximation problem and pursuit algorithms that aim to approximate its solution.

Lecture 2: The theory on the uniqueness of the sparsest solution of a linear system, the notion of stability for the noisy case, guarantees for the performance of pursuit algorithms using the mutual coherence and the RIP.

Lecture 3: Signal (and image) models and their importance, the Sparseland model and its use, analysis versus synthesis modeling, a Bayesian estimation point of view, dictionary learning with the MOD and the K-SVD, global and local image denoising, local image inpainting.

Lecture 4: Sparse representations in image processing – image deblurring, global image separation and image inpainting. using dictionary learning for image and video denoising and inpainting, image scale-up using a pair of learned dictionaries, facial image compression with the K-SVD.

Scaling up a single image while preserving is sharpness and visual-quality is a difficult and highly ill-posed inverse problem. A series of algorithms have been proposed over the years for its solution, with varying degrees of success. In CVPR 2008, Yang, Wright, Huang and Ma proposed a solution to this problem based on sparse representation modeling and dictionary learning. In this talk I present a variant of their method with several important differences. In particular, the proposed algorithm does not need a separate training phase, as the dictionaries are learned directly from the image to be scaled-up. Furthermore, the high-resolution dictionary is learned differently, by forcing its alignment with the low-resolution one. We show the benefit these modifications bring in terms of simplicity of the overall algorithm, and its output quality.

In this talk we describe applications such as image denoising and beyond using sparse and redundant representations. Our focus is on ways to perform these tasks with trained dictionaries using the K-SVD algorithm. As trained dictionaries are limited in handling small image patches, we deploy these within a Bayesian reconstruction procedure by forming an image prior that forces every patch in the resulting image to have a sparse representation.

Super-resolution reconstruction proposes a fusion of several low quality images into one higher quality result with better optical resolution. Classic super resolution techniques strongly rely on the availability of accurate motion estimation for this fusion task. When the motion is estimated inaccurately, as often happens for non-global motion fields, annoying artifacts appear in the super-resolved outcome. Encouraged by recent developments on the video denoising problem, where state-of-the-art algorithms are formed with no explicit motion estimation, we seek a super-resolution algorithm of similar nature that will allow processing sequences with general motion patterns. In this talk we base our solution on the Non-Local-Means (NLM) algorithm. We show how this denoising method is generalized to become a relatively simple super-resolution algorithm with no explicit motion estimation. Results on several test movies show that the proposed method is very successful in providing super-resolution on general sequences.

In this survey talk we focus on the use of sparse and redundant representations and learned dictionaries for image denoising and other related problems. We discuss the the K-SVD algorithm for learning a dictionary that describes the image content effectively. We then show how to harness this algorithm for image denoising, by working on small patches and forcing sparsity over the trained dictionary. The above is extended to color image denoising and inpainitng, video denoising, and facial image compression, leading in all these cases to state of the art results. We conclude with very recent results on the use of several sparse representations for getting better denoising performance. An algorithm to generate such set of representations is developed, and our analysis shows that by this method we approximate the minimum-mean-squared-error (MMSE) estimator, thus getting better results.