« Poster session and cocktail reception
August 14, 2018, 5:00 PM - 7:00 PM
Feasible Level-set Methods for Optimization with Stochastic or Data-driven Constraints, Qihang Lin, University of Iowa
We consider the constrained optimization where the objective function and the constraints are given as either finite sums or expectations. We propose a new feasible level-set method to solve this class of problems, which can produce a feasible solution path. To update a level parameter towards the optimality, our level-set method requires an oracle that generates upper and lower bounds as well as an affine-minorant of the level function. To construct the desired oracle, we reformulate the level function as the value of a saddle-point problem using the conjugate and perspective of constraints. Then a stochastic gradient method with a special Bregman divergence is proposed as the oracle for solving that saddle-point problem. The special divergence ensures the proximal mapping in each iteration can be solved in a closed form. The total complexity of both level-set methods using the proposed oracle are analyzed.
Bounding and Counting Linear Regions of Deep Neural Networks, Thiago Serra, Mitsubishi Electric Research Labs
We investigate the complexity of deep neural networks (DNN) that represent piecewise linear (PWL) functions. In particular, we study the number of linear regions that a PWL function represented by a DNN can attain, both theoretically and empirically. We present (i) tighter upper and lower bounds for the maximum number of linear regions on rectifier networks, which are exact for inputs of dimension one; (ii) a first upper bound for multi-layer maxout networks; and (iii) a first method to perform exact enumeration or counting of the number of regions by modeling the DNN with a mixed-integer linear formulation. These bounds come from leveraging the dimension of the space defining each linear region. The results also indicate that a deep rectifier network can only have more linear regions than any shallow counterpart with same number of neurons if that number exceeds the dimension of the input.
Level-set Methods for Finite-sum Constrained Convex Optimization, Runchao Ma, University of Iowa
We consider the constrained optimization where the objective function and the constraints are defined as summation of finitely many loss functions. This model has applications in machine learning such as Neyman-Pearson classification. We consider two level-set methods to solve this class of problems, an existing inexact Newton method and a new feasible level-set method. To update the level parameter towards the optimality, both methods require an oracle that generates upper and lower bounds as well as an affine-minorant of the level function. To construct the desired oracle, we reformulate the level function as the value of a saddle-point problem using the conjugate and perspective of the loss functions. Then a stochastic variance-reduced gradient method with a special Bregman divergence is proposed as the oracle for solving that saddle-point problem. The special divergence ensures the proximal mapping in each iteration can be solved in a closed form. The total complexity of both level-set methods using the proposed oracle are analyzed.
A Machine Learning Approximation Algorithm for Fast Prediction of Solutions to Discrete Optimization Problems, Sébastien Lachapelle, Montreal Institute for Learning Algorithms
The paper presented provides a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a methodology to predict descriptions of solutions to discrete stochastic optimization problems in very short computing time. We approximate the solutions based on supervised learning and the training dataset consists of a large number of deterministic problems that have been solved independently (and offline). Uncertainty regarding a subset of the inputs is addressed through sampling and aggregation methods. Our motivating application concerns booking decisions of intermodal containers on double-stack trains. Under perfect information, this is the so-called load planning problem and it can be formulated by means of integer linear programming. However, the formulation cannot be used for the application at hand because of the restricted computational budget and unknown container weights. The results show that standard deep learning algorithms allow to predict descriptions of solutions with high accuracy in very short time (milliseconds or less).
On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes, Xiaoyu Li, Stony Brook University
Stochastic gradient descent is the method of choice for large scale optimization of machine learning objective functions. Yet, its performance is greatly variable and heavily depends on the choice of the stepsizes. This has motivated a large body of research on adaptive stepsizes. However, there is currently a gap in our theoretical understanding of these methods, especially in the non-convex setting. In this work, we start closing this gap: we theoretically analyze the use of adaptive stepsizes, like the ones in AdaGrad, in the non-convex setting. We show sufficient conditions for almost sure convergence to a stationary point when the adaptive stepsizes are used, proving the first guarantee for AdaGrad in the non-convex setting. Moreover, we show explicit rates of convergence that automatically interpolates between $O(1/T)$ and $O(1/sqrt{T})$ depending on the noise of the stochastic gradients, in both the convex and non-convex setting.
Distributed First-order Algorithms with Gradient Tracking Converge to Second-order Stationary Solutions, Amir Daneshmand, Purdue University
Several first-order optimization algorithms have been shown recently to converge to second-order stationary solutions of smooth nonconvex optimization problems, under mild conditions. There is no analogous guarantee for decentralized schemes solving nonconvex smooth multiagent problems over networks, modeled as directed static graphs. This work provides a positive answer to this open question. We prove that the family of decentralized algorithms employing distributed gradient tracking converges to second-order stationary solutions, under standard assumptions on the step-size.
Estimation of Individualized Decision Rules Based on an Optimized Covariate-dependent Equivalent of Random Outcomes, Zhengling Qi, University of North Carolina
Recent exploration of optimal individualized decision rules (IDRs) for patients in precision medicine has attracted a lot of attention due to the heterogeneous responses of patients to different treatments. In the existing literature of precision medicine, an optimal IDR is defined as a decision function mapping from the patients' covariate space into the treatment space that maximizes the expected outcome of each individual. Motivated by the concept of Optimized Certainty Equivalent (OCE) introduced originally in the study of Ben-Tal and Teboulle (2007) that includes the popular conditional-value-of risk (CVaR) in the study of Rockafellar and Uryasev (2000), we propose a decision-rule based optimized covariates dependent equivalent (CDE) for individualized decision making problems. Our proposed IDR-CDE broadens the existing expected-mean outcome framework in precision medicine and enriches the previous concept of the OCE. Under a functional margin description of the decision rule modeled by an indicator function as in the literature of large-margin classifiers, we study the mathematical problem of estimating an optimal IDRs in two cases: in one case, an optimal solution can be obtained ``explicitly'' that involves the implicit evaluation of an OCE; the other case requires the numerical solution of an empirical minimization problem obtained by sampling the underlying distributions of the random variables involved. A major challenge of the latter optimization problem is that it involves a discontinuous objective function. We show that, under a mild condition at the population level of the model, the epigraphical formulation of this empirical optimization problem is a piecewise affine, thus difference-of-convex (dc), constrained dc, thus nonconvex, program. A simplified dc algorithm is employed to solve the resulting dc program whose convergence to a new kind of stationary solutions is established. Numerical experiments demonstrate that our overall approach outperforms existing methods in estimating optimal IDRs under heavy-tail distributions of the data. In addition to providing a risk-based approach for individualized medical treatments, which is new in the area of precision medicine, the main contributions of this work in general include: the broadening of the concept of the OCE, the epigraphical description of the empirical IDR-CDE minimization problem and its equivalent dc formulation, and the optimization of resulting piecewise affine constrained dc program.
A Stochastic Trust Region Algorithm Based on Careful Step Normalization, Rui Shi, Lehigh University
In this poster we present a new stochastic trust region method, deemed TRish, for solving stochastic and finite-sum minimization problems. We motivate our approach by illustrating how it can be derived from a trust region methodology. However, we also illustrate how a direct adaptation of a trust region methodology might fail to lead to general convergence guarantees. Hence, our approach involves a modified update scheme, which we prove possesses convergence guarantees that are similar to those for a traditional stochastic gradient (SG) method. We also present numerical results showing that TRish can outperform SG when solving convex and nonconvex machine learning test problems.
Revisiting the Foundations of Randomized Gossip Algorithms, Nicolas Loizou, University of Edinburgh
In this work we present a new framework for the analysis and design of randomized gossip algorithms for solving the average consensus problem. We present how randomized Kaczmarz-type methods - classical methods for solving linear systems - work as gossip algorithms when applied to a special system encoding the underlying network and explain their distributed nature. We reveal a hidden duality of randomized gossip algorithms, with the dual iterative process maintaining a set of numbers attached to the edges as opposed to nodes of the network. Subsequently, using the new framework we propose novel block and accelerated randomized gossip protocols.
Underestimate Sequences via Quadratic Averaging, Majid Jahani, Lehigh University
In this work we introduce the concept of an Underestimate Sequence (UES), which is a natural extension of Nesterov's estimate sequence. Our definition of a UES utilizes three sequences, one of which is a lower bound (or under-estimator) of the objective function. The question of how to construct an appropriate sequence of lower bounds is also addressed, and we present lower bounds for strongly convex smooth functions and for strongly convex composite functions, which adhere to the UES framework. Further, we propose several first order methods for minimizing strongly convex functions in both the smooth and composite cases. The algorithms, based on efficiently updating lower bounds on the objective functions, have natural stopping conditions, which provides the user with a certificate of optimality. Convergence of all algorithms is guaranteed through the UES framework, and we show that all presented algorithms converge linearly, with the accelerated variants enjoying the optimal linear rate of convergence.
A Unifying Scheme Of Primal-Dual Algorithms for Distributed Optimization Fatemeh Mansoori, Northwestern University
We study the problem of minimizing a sum of local convex objective functions over a network of processors/agents. Each agent in the network has access to a component of the objective function and can communicate with its neighbors. This problem can be formulated in a distributed framework by defining local copies of the decision variable for the agents. Each agent, then, works toward decreasing its local cost function, while keeping its variable equal to those of the neighbors. Many of the existing distributed algorithms with constant stepsize can only converge to a neighborhood of optimal solution. To circumvent this shortcoming, we propose to develop a class of distributed primal-dual algorithms based on augmented Lagrangian. The dual updates ensure exact convergence and the augmented quadratic term guarantees convergence. To improve convergence speed, we design algorithms with multiple primal updates per iteration. We can show that such algorithms converge to the optimal solution under appropriate constant stepsize choices. Simulation results confirm the superior of our algorithms to those with only one primal update at each iteration. The proposed class of algorithms and the convergence guarantees can be extended to the general form of linearly-constrained convex optimization problems.
Extrapolation of Finite Element Simulation with Graph Convolutional Lstm, Yue Niu, Lehigh University
Bioprosthetic heart valves are one of the best choices for heart valve replacement. However, the fact that their life span is generally limited to 15 years becomes a major concern of their clinical use. Several groups are studying the underlying mechanism of the degradation with different methods. Although numerical modeling has certain advantages over in vivo experiments, the computational cost for complete simulations is not yet acceptable. Here, we introduce our LSTM based neural network model to alleviate the resource consumption. It takes the simulation results of initial cycles and additional parameters that could easily be measured as guideline to predict the simulation result later. Due to the nature of finite element data structure, the graph convolution idea is also applied to smooth out the result and achieve smaller error. Results show that this model could keep track of the slow changes among the cycles. It is hoped that in this scheme, some slowly changing stages could be skipped, thus the simulation could be accelerated significantly.
Expected Risk and Auc Optimization without Dependence on the Data Set Size, Minhan Li, Lehigh University
We present a method for directly optimizing expected risk and expected AUC for linear classification proposed by Ghanbari and Scheinberg in 2017. This method constructs an optimization problem whose size is independent on the size of the data set. The empirical results show that it can produce competitive predictors in fraction of the time required by solving typical empirical risk minimization problem. However, for problems with large dimension this method is computationally costly. We propose further improvements to reduce the computational cost for large dimensional problems.
Robust Learning of Trimmed Estimators via Manifold Sampling, Matt Menickelly, Argonne National Laboratory
We adapt a manifold sampling algorithm for nonsmooth, nonconvex formulations of learning while remaining robust to outliers in the training data. We demonstrate the approach on objectives based on trimmed loss. Empirical results show that the method has favorable scaling properties. Although savings in time come at the expense of not certifying global optimality, the algorithm consistently returns high-quality solutions on the trimmed linear regression and multi-class classification problems tested
Projective Splitting with Forward Steps: Asynchronous and Block-iterative Operator Splitting, Patrick Johnstone, Rutgers University
This work is concerned with the classical problem of finding a zero of a sum of maximal monotone operators. For the projective splitting framework recently proposed by Combettes and Eckstein, we show how to replace the fundamental subproblem calculation using a backward step with one based on two forward steps. The resulting algorithms have the same kind of coordination procedure and can be implemented in the same block-iterative and potentially distributed and asynchronous manner, but may perform backward steps on some operators and forward steps on others. Prior algorithms in the projective splitting family have used only backward steps. Forward steps can be used for any Lipschitz-continuous operators provided the stepsize is bounded by the inverse of the Lipschitz constant. If the Lipschitz constant is unknown, a simple backtracking linesearch procedure may be used. Interestingly, this backtracking procedure also leads to a convergent algorithm even when the operator is only uniformly continuous, but not necessarily Lipschitz, provided it has full domain. For affine operators, the stepsize can be chosen adaptively without knowledge of the Lipschitz constant and without any additional forward steps. Convergence rates under various assumptions are also discussed along with preliminary experiments of several kinds of splitting algorithms on the lasso problem.
Inexact SARAH for Solving Stochastic Optimization Problems, Lam Nguyen, Lehigh University
We consider a general stochastic optimization problem which is often at the core of supervised learning, such as deep learning and linear classification. It is not able to apply SVRG and SARAH for this problem since these algorithms require an exact gradient information. We consider Inexact SARAH, which does not require to compute an exact gradient at each outer iteration; and analyze it in strongly convex, convex, and nonconvex cases. We also compare it with some existing stochastic algorithms.
Kernel Methods: Optimal Online Compressions, and Controlling Error Variance, Alec Koppel, U.S. Army Research Laboratory
In supervised learning, we learn a statistical model by minimizing some merit of fitness averaged over data. Doing so, however, ignores the model variance which quantifies the gap between the optimal within a hypothesized function class and the Bayes Risk. We propose to account for both the bias and variance by modifying training procedure to incorporate coherent risk which quantifies the uncertainty of a given decision. We develop the first online iterative solution to this problem when estimators belong to a reproducing kernel Hilbert space (RKHS), which we call Compositional Online Learning with Kernels (COLK). COLK addresses the fact that (i) minimizing risk functions requires handling objectives which are compositions of expected value problems by generalizing the two time-scale stochastic quasi-gradient method to RKHSs; and (ii) the RKHS-induced parameterization has complexity which is proportional to the iteration index which is mitigated through greedily constructed subspace projections. We establish almost sure convergence of COLK with attenuating step-sizes, and linear convergence in mean to a neighborhood with constant step-sizes, as well as the fact that its worst-case complexity is bounded. Experiments on data with heavy-tailed distributions illustrate that COLK exhibits robustness to outliers, consistent performance across training runs, and thus marks a step towards ameliorating the problem of overfitting.
Towards Fast Computation of Certified Robustness for ReLU Networks, Tsui-Wei (Lily) Weng, MIT
Verifying the robustness property of a general Rectified Linear Unit (ReLU) network is an NP-complete problem. Although finding the exactminimum adversarial distortion is hard, giving a certified lower boundof the minimum distortion is possible. Current available methods of computing such a bound are either time-consuming or deliver low quality bounds that are too loose to be useful. In this work, we exploit the special structure of ReLU networks and provide two computationally efficient algorithms (Fast-Lin,Fast-Lip) that are able to certify non-trivial lower bounds of minimum adversarial distortions. Experiments show that (1) our methods deliver bounds close to (the gap is 2-3X) exact minimum distortions found by Reluplex in small networks while our algorithms are more than 10,000 times faster; (2) our methods deliver similar quality of bounds (the gap is within 35% and usually around 10%; sometimes our bounds are even better) for larger networks compared to the methods based on solving linear programming problems but our algorithms are 33-14,000 times faster; (3) our method is capable of solving large MNIST and CIFAR networks up to 7 layers with more than 10,000 neurons within tens of seconds on a single CPU core.