Towards Lower Bounds on the Depth
of ReLU Neural Networks
††thanks: Authors’ accepted manuscript; to appear in the SIAM Journal on Discrete Mathematics. A preliminary conference version appeared in the proceedings of the NeurIPS 2021 conference. We thank the anonymous referees of both the journal and the conference version for their insightful comments which helped to improve the presentation and clarity.
Christoph Hertrich gratefully acknowledges funding by DFG-GRK 2434 “Facets of Complexity”. Amitabh Basu gratefully acknowledges support from AFOSR Grant FA95502010341 and NSF Grant CCF2006587. Martin Skutella gratefully acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy — The Berlin Mathematics Research Center MATH+ (EXC-2046/1, project ID: 390685689).
Abstract
We contribute to a better understanding of the class of functions that can be represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning any function. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). As a by-product of our investigations, we settle an old conjecture about piecewise linear functions by Wang and Sun [74] in the affirmative. We also present upper bounds on the sizes of neural networks required to represent functions with logarithmic depth.
1 Introduction
A core problem in machine learning and statistics is the estimation of an unknown data distribution with access to independent and identically distributed samples from the distribution. It is well-known that there is a tension between the expressivity of the model chosen to approximate the distribution and the number of samples needed to solve the problem with high confidence (or equivalently, the variance one has in one’s estimate). This is referred to as the bias-variance trade-off or the bias-complexity trade-off. Neural networks provide a way to turn this bias-complexity knob in a controlled manner that has been studied for decades going back to the idea of a perceptron by Rosenblatt [62]. This is done by modifying the architecture of a neural network class of functions, in particular its size in terms of depth and width. As one increases these parameters, the class of functions becomes more expressive. In terms of the bias-variance trade-off, the “bias” decreases as the class of functions becomes more expressive, but the “variance” or “complexity” increases.
So-called universal approximation theorems [5, 18, 40] show that even with a single hidden layer, that is, when the depth of the architecture achieves its smallest possible value, one can essentially reduce the “bias” as much as one desires, by increasing the width. Nevertheless, it can be advantageous both theoretically and empirically to increase the depth because a substantial reduction in the size can be achieved by this [6, 21, 46, 63, 69, 70, 75]. To get a better quantitative handle on these trade-offs, it is important to understand what classes of functions are exactly representable by neural networks with a certain architecture. The precise mathematical statements of universal approximation theorems show that single layer networks can arbitrarily well approximate any continuous function (under some additional mild hypotheses). While this suggests that single layer networks are good enough from a learning perspective, from a mathematical perspective, one can ask the question if the class of functions represented by single layer networks is a strict subset of the class of functions represented by networks with two or more hidden layers. On the question of size, one can ask for precise bounds on the required width of a network with given depth to represent a certain class of functions. A better understanding of the function classes exactly represented by different architectures has implications not just for mathematical foundations, but also algorithmic and statistical learning aspects of neural networks, as recent advances on the training complexity show [6, 11, 28, 23, 42]. The task of searching for the “best” function in a class can only benefit from a better understanding of the nature of functions in that class. A motivating question behind the results in this paper is to understand the hierarchy of function classes exactly represented by neural networks of increasing depth.
We now introduce more precise notation and terminology to set the stage for our investigations.
1.1 Notation and Definitions
We write for the set of natural numbers up to (without zero) and for the same set including zero. For any , let be the component-wise rectifier function
For any number of hidden layers , a -layer feedforward neural network with rectified linear units (ReLU NN or simply NN) is given by affine transformations , , for , and a linear transformation , . It is said to compute oder represent the function given by
The matrices are called the weights and the vectors are the biases of the -th layer. The number is called the width of the -th layer. The maximum width of all hidden layers is called the width of the NN. Further, we say that the NN has depth and size .
Often, NNs are represented as layered, directed, acyclic graphs where each dimension of each layer (including input layer and output layer ) is one vertex, weights are arc labels, and biases are node labels. Then, the vertices are called neurons.
For a given input , let be the activation vector and the output vector of the -th layer. Further, let be the output of the NN. We also say that the -th component of each of these vectors is the activation or the output of the -th neuron in the -th layer.
To illustrate the definition of NNs and how they compute functions, Figure 1 shows an NN with one hidden layer computing the maximum of two numbers.
For , we define
By definition, a continuous function is piecewise linear in case there is a finite set of polyhedra whose union is , and is affine linear over each such polyhedron.
In order to analyze , we use another function class defined as follows. We call a function a -term max function if it can be expressed as maximum of affine terms, that is, where is affine linear for . Note that this also includes max functions with less than terms, as some functions may coincide. Based on that, we define
Note that Wang and Sun [74] call -term max functions -order hinges and linear combinations of those -order hinging hyperplanes.
If the input dimension is not important for the context, we sometimes drop the index and use and instead.
We will use the standard notations and for the convex and conic hulls of a set . For an in-depth treatment of polyhedra and (mixed-integer) optimization, we refer to the book by Schrijver [64].
1.2 Representing Piecewise Linear Functions with ReLU Networks
It is not hard to see that every function expressed by a ReLU network is continuous and piecewise linear (CPWL) because it is composed of affine transformations and ReLU functions, which are both CPWL. Based on a result by Wang and Sun [74], Arora et al. [6] prove that the converse is true as well by showing that any CPWL function can be represented with logarithmic depth.
Theorem 1.1 (Arora et al. [6]).
If and , then .
Since this result is the starting point for our paper, let us briefly sketch its proof. For this purpose, we start with a simple special case of a CPWL function: the maximum of numbers. Recall that one hidden layer suffices to compute the maximum of two numbers, see Figure 1. Now one can easily stack this operation: in order to compute the maximum of four numbers, we divide them into two pairs with two numbers each, compute the maximum of each pair and then the maximum of the two results. This idea results in the NN depicted in Figure 2, which has two hidden layers.
Repeating this procedure, one can compute the maximum of eight numbers with three hidden layers, and, in general, the maximum of numbers with hidden layers. Phrasing this the other way around, we obtain that the maximum of numbers can be computed with hidden layers. Since NNs can easily form affine combinations, this implies the following lemma.
Lemma 1.2 (Arora et al. [6]).
If , then .
The question whether the depth of this construction is best possible is one of the central open questions we attack in this paper.
In fact, the maximum function is not just a nice toy example, it is, in some sense, the most difficult one of all CPWL function to represent for a ReLU NN. This is due to a result by Wang and Sun [74] stating that every CPWL function defined on can be written as linear combination of -term max functions.
Theorem 1.3 (Wang and Sun [74]).
If , then .
The proof given by Wang and Sun [74] is technically involved and we do not go into details here. However, in Section 4 we provide an alternative proof yielding a slightly stronger result. This will be useful to bound the width of NNs representing arbitrary CPWL functions.
Theorem 1.1 by Arora et al. [6] can now be deduced from combining Lemma 1.2 and Theorem 1.3: In fact, for , one obtains
and thus equality in the whole chain of subset relations.
1.3 Our Main Conjecture
We wish to understand whether the logarithmic depth bound in Theorem 1.1 by Arora et al. [6] is best possible or whether one can do better. We believe it is indeed best possible and pose the following conjecture to better understand the importance of depth in neural networks.
Conjecture 1.4.
For every , let . Then it holds that
(1) |
1.4 claims that any additional layer up to hidden layers strictly increases the set of representable functions. This would imply that the construction by Arora et al. [6] is actually depth-minimal.
Observe that, in order to prove 1.4, it is sufficient to find, for every , one function with . This also implies all other strict inclusions for because immediately implies that for all .
In fact, thanks to Theorem 1.3 by Wang and Sun [74], there is a canonical candidate for such a function, allowing us to reformulate the conjecture as follows.
Conjecture 1.5.
For , , the function cannot be represented with hidden layers, that is, .
Proof.
We argued above that 1.5 implies 1.4. For the other direction, we prove the contraposition, that is, assuming that 1.5 is violated, we show that 1.4 is violated as well. To this end, suppose there is a , , such that is representable with hidden layers. We argue that under this hypothesis, any -term max function can be represented with hidden layers. To see this, observe that
Modifying the first-layer weights of the NN computing such that input is replaced by the affine expression , one obtains a -hidden-layer NN computing the function . Moreover, since affine functions, in particular also , can easily be represented by -hidden-layer NNs, we obtain that any -term maximum is in . Using Theorem 1.3 by Wang and Sun [74], it follows that . In particular, since , we obtain that 1.4 must be violated as well. ∎
It is known that 1.5 holds for [56], that is, the CPWL function cannot be computed by a 2-layer NN. The reason for this is that the set of breakpoints of a CPWL function computed by a 2-layer NN is always a union of lines, while the set of breakpoints of is a union of three half-lines; compare Figure 3 and the detailed proof by Mukherjee and Basu [56]. Moreover, in subsequent work to the first version of this article, it was shown that the conjecture is true for all if one only allows integer weights in the neural network [31]. However, this proof does not easily generalize to arbitrary, real-valued weights. Thus, the conjecture remains open for all .
1.4 Contribution and Outline
In this paper, we present the following results as partial progress towards resolving this conjecture.
In Section 2, we resolve 1.5 for , under a natural assumption on the breakpoints of the function represented by any intermediate neuron. Intuitively, the assumption states that no neuron introduces unexpected breakpoints compared to the final function we want to represent. We call such neural networks -conforming, see Section 2 for a formal definition. We then provide a computer-based proof leveraging techniques from mixed-integer programming for the following theorem.
Theorem 1.7.
There does not exist an -conforming 3-layer ReLU NN computing the function .
In the light of Lemma 1.2, stating that for all , one might ask whether the converse is true as well, that is, whether the classes and are actually equal. This would not only provide a neat characterization of , but also prove 1.5 without any additional assumption since one can show that is not contained in .
In fact, for , it is true that , that is, a function is computable with one hidden layer if and only if it is a linear combination of 2-term max functions. However, in Section 3, we show the following theorem.
Theorem 1.8.
For every , the set is a strict superset of .
To achieve this result, the key technical ingredient is the theory of polyhedral complexes associated with CPWL functions. This way, we provide important insights concerning the richness of the class . As a by-product, the results in Section 3 imply that is a strict subset of , which was conjectured by Wang and Sun [74] in 2005, but has been open since then.
So far, we have focused on understanding the smallest depth needed to express CPWL functions using neural networks with ReLU activations. In Section 4, we complement these results by upper bounds on the sizes of the networks needed for expressing arbitrary CPWL functions. In particular, we show the following theorem.
Theorem 1.9.
Let be a CPWL function with affine pieces. Then can be represented by a ReLU NN with depth and width .
We arrive at this result by introducing a novel application of recently established interrelations between neural networks and tropical geometry.
Theorem 1.9 improves upon a previous bound by He et al. [35] because it is polynomial in if is regarded as fixed constant, while the bounds in [35] are exponential in . In subsequent work to the first version of our article, it was shown that the width of the network can be drastically decreased if one allows more depth (in the order of instead of ) [16].
Let us remark that there are different definitions of the number of pieces of a CPWL function in the literature, compare the discussions in [16, 35] about pieces versus linear components. Our bounds work with any of these definitions since they apply to the smallest possible way to define , called linear components in [16]: for our purposes, can be defined as the smallest number of affine functions such that, at each point, is equal to one of these affine functions. Since all other definitions of the number of pieces are at least that large, our bounds are valid for these definitions as well.
Finally, in Section 5, we provide an outlook how these interactions between tropical geometry and NNs could possibly also be useful to provide a full, unconditional proof of 1.4 by means of polytope theory. This yields another equivalent rephrasing of 1.4 which is stated purely in the language of basic operations on polytopes and does not involve neural networks any more.
We conclude in Section 6 with a discussion of further open research questions.
1.5 Further Related Work
Depth versus size
Soon after the original universal approximation theorems [18, 40], concrete bounds were obtained on the number of neurons needed in the hidden layer to achieve a certain level of accuracy. The literature on this is vast and we refer to a small representative sample here [8, 9, 51, 60, 52, 53]. More recent research has focused on how deeper networks can have exponentially or super exponentially smaller size compared to shallower networks [72, 6, 21, 32, 33, 46, 57, 61, 63, 69, 70, 75]. See also [29] for another perspective on the relationship between expressivity and architecture, and the references therein.
Mixed-integer optimization and machine learning
Over the past decade, a growing body of work has emerged that explores the interplay between mixed-integer optimization and machine learning. On the one hand, researchers have attempted to improve mixed-integer optimization algorithms by exploiting novel techniques from machine learning [13, 24, 34, 43, 44, 45, 47, 3]; see also [10] for a recent survey. On the flip side, mixed-integer optimization techniques have been used to analyze function classes represented by neural networks [67, 4, 22, 66, 65]. In Section 2 below, we show another new use of mixed-integer optimization tools for understanding function classes represented by neural networks.
Design of training algorithms
We believe that a better understanding of the function classes represented exactly by a neural architecture also has benefits in terms of understanding the complexity of the training problem. For instance, in work by Arora et al. [6], an understanding of single layer ReLU networks enables the design of a globally optimal algorithm for solving the empirical risk minimization (ERM) problem, that runs in polynomial time in the number of data points in fixed dimension. See also [25, 26, 27, 19, 14, 28, 23, 1, 12, 11, 17, 42] for similar lines of work.
Neural Networks and Tropical Geometry
A recent stream of research involves the interplay between neural networks and tropical geometry. The piecewise linear functions computed by neural networks can be seen as (tropical quotients of) tropical polynomials. Linear regions of these functions correspond to vertices of so-called Newton polytopes associated with these tropical polynomials. Applications of this correspondence include bounding the number of linear regions of a neural network [76, 15, 54] and understanding decision boundaries [2]. In Section 4 we present a novel application of tropical concepts to understand neural networks. We refer to [50] for a recent survey of connections between machine learning and tropical geometry, as well as to the textbooks by Maclagan and Sturmfels [49] and Joswig [41] for in-depth introductions to tropical geometry and tropical combinatorics.
2 Conditional Lower Depth Bounds via Mixed-Integer Programming
In this section, we provide a computer-aided proof that, under a natural, yet unproven assumption, the function cannot be represented by a 3-layer NN. It is worth to note that, to the best of our knowledge, no CPWL function is known for which the non-existence of a 3-layer NN can be proven without additional assumptions. For ease of notation, we write .
We first prove that we may restrict ourselves to NNs without biases. This holds true independent of our assumption, which we introduce afterwards.
Definition 2.1.
A function is called positively homogeneous if it satisfies for all .
Definition 2.2.
For an NN given by transformations , we define the corresponding homogenized NN to be the NN given by with all biases set to zero.
Proposition 2.3.
If an NN computes a positively homogeneous function, then the corresponding homogenized NN computes the same function.
Proof.
Let be the function computed by the original NN and the one computed by the homogenized NN. Further, for any , let
be the function computed by the sub-NN consisting of the first -layers and let be the function computed by the corresponding homogenized sub-NN. We first show by induction on that the norm of is bounded by a global constant that only depends on the parameters of the NN but not on .
For , we have , settling the induction base. For the induction step, let and assume that , where only depends on the parameters of the NN. Since a component-wise application of the ReLU activation function has Lipschitz constant 1, this implies . Using the spectral matrix norm of a matrix , we obtain:
Since the right-hand side only depends on NN parameters, the induction is completed.
Finally, we show that . For the sake of contradiction, suppose that there is an with . Let ; then, by positive homogeneity of (by assumption) and (by construction and because the ReLU function is positively homogeneous), it follows that , contradicting the property shown above. Thus, we have . ∎
Since is positively homogeneous, Proposition 2.3 implies that, if there is a 3-layer NN computing , then there also is one that has no biases. Therefore, in the remainder of this section, we only consider NNs without biases and assume implicitly that all considered CPWL functions are positively homogeneous. In particular, any piece of such a CPWL function is linear and not only affine linear.
Observe that, for the function , the only points of non-differentiability (a.k.a. breakpoints) are at places where at least two of the five numbers , , , , and are equal. Hence, if some neuron of an NN computing introduces breakpoints at other places, these breakpoints must be canceled out by other neurons. Therefore, we find it natural to work under the assumption that such breakpoints need not be introduced at all in the first place.
To make this assumption formal, let , for , be ten hyperplanes in and be the corresponding hyperplane arrangement. This is the intersection of the so-called braid arrangement in five dimensions with the hyperplane [68]. The regions oder cells of are defined to be the closures of the connected components of . It is easy to see that these regions are in one-to-one correspondence to the possible orderings of the five numbers , , , , and . More precisely, for a permutation of the five indices , the corresponding region is the polyhedron
Definition 2.4.
We say that a (positively homogeneous) CPWL function is -conforming, if it is linear within any of these regions of , that is, if it only has breakpoints where the relative ordering of the five values , , , , changes. Moreover, an NN is said to be -conforming if the output of each neuron contained in the NN is -conforming.
See Figure 4 for an illustration of the definition in the (simpler) two-dimensional case. Note that, by the definition, an NN is -conforming if and only if, for all layers , the intermediate function is -conforming.
As argued above, it is plausible that considering -conforming NNs is enough to prove 1.4. In other words, we conjecture that, if there exists a 3-layer NN computing the function , then there also exists one that is -conforming. This motivates the following theorem, which we prove computer-aided by means of mixed-integer programming.
See 1.7
The remainder of this section is devoted to proving this theorem. The rough outline of the proof is as follows. We first study some geometric properties of the hyperplane arrangement . This will show that each of the cells of is a simplicial polyhedral cone spanned by extreme rays. In total, there are such rays (because rays are used multiple times to span different cones). This implies that each -conforming function is uniquely determined by its values on the rays and, therefore, the set of -conforming functions of type is a -dimensional vector space. We then use linear algebra to show that the space of functions generated by /̄conforming two-layer NNs is a -dimensional subspace. Moreover, with two hidden layers, at least of the dimensions can be generated and is not contained in this -dimensional subspace. So the remaining question is whether the dimensions producible with the first hidden layer can be combined in such a way that after applying a ReLU activation in the second hidden layer, we do not end up within the -dimensional subspace. We model this question as a mixed-integer program (MIP). Solving the MIP yields that we always end up within the -dimensional subspace, implying that cannot be represented by a 3-layer NN. This provides a computational proof of Theorem 1.7.
Let us start with investigating the structure of the hyperplane arrangement . For readers familiar with the interplay between hyperplane arrangements and polytopes, it is worth noting that is dual to a combinatorial equivalent of the 4/̄dimensional permutahedron. Hence, what we are studying in the following are some combinatorial properties of the permutahedron.
Recall that the regions of are given by the 120 polyhedra
for each permutation of , where is used as a replacement for . With this representation, one can see that is a pointed polyhedral cone (with the origin as its only vertex) spanned by the four half-lines (a.k.a. rays)
Observe that these objects are indeed rays anchored at the origin because the three equalities define a one-dimensional subspace of and the inequality cuts away one of the two directions.
With that notation, we see that each of the cells of is a simplicial cone spanned by four out of the rays with . For each such set , denote its complement by . Let us use a generating vector for each of these rays such that as follows: If , then , otherwise , where for each , the vector contains entries at precisely those index positions that are contained in and entries elsewhere. For example, and . Then, the set containing conic generators of all the rays of consists of the vectors .
Let be the space of all -conforming CPWL functions of type . We show that is a -dimensional vector space.
Lemma 2.5.
The map that evaluates a function at the rays in is an isomorphism between and . In particular, is a /̄dimensional vector space.
Proof.
First note that is closed under addition and scalar multiplication. Therefore, it is a subspace of the vector space of continuous functions of type , and thus, in particular, a vector space. We show that the map is in fact a vector space isomorphism. The map is obviously linear, so we only need to show that it is a bijection. In order to do so, remember that is the union of the simplicial cones . In particular, given the function values on the extreme rays of these cones, there is a unique positively homogeneous, continuous continuation that is linear within each of the 120 cones. This implies that the considered map is a bijection between and . ∎
The previous lemma also provides a canonical basis of the vector space : the one consisting of all CPWL functions attaining value at one ray and value at all other rays. However, it turns out that for our purposes it is more convenient to work with a different basis. To this end, let for each with . These functions contain, among other functions, the four (linear) coordinate projections , , and the function .
Lemma 2.6.
The functions with form a basis of .
Proof.
Evaluating the functions at all rays yields vectors in . It can be easily verified (e.g., using a computer) that these vectors form a basis of . Thus, due to the isomorphism of Lemma 2.5, the functions form a basis of . ∎
Next, we focus on particular subspaces of generated by only some of the functions . We prove that they correspond to the spaces of functions computable by -conforming - and -layer NNs, respectively.
To do so, let be the set of the basis functions with and . Let be the -dimensional subspace spanned by . Similarly, let be the set of the basis functions with (all but ). Let be the -dimensional subspace spanned by .
Lemma 2.7.
The space consists of all functions computable by -conforming -layer NNs.
Proof.
Each function in is a linear combination of -term max functions by definition. Hence, by Lemma 1.2, it can be represented by a 2-layer NN.
Conversely, we show that any function representable by a 2-layer NN is indeed contained in . It suffices to show that the output of every neuron in the first (and only) hidden layer of an -conforming ReLU NN is in because the output of a 2-layer NN is a linear combination of such outputs. Let be the first-layer weights of such a neuron, computing the function , which has the hyperplane as breakpoints (or is constantly zero). Since the NN must be -conforming, this must be one of the ten hyperplanes , . Thus, for some . If , it follows that , and if , we obtain . This concludes the proof. ∎
For 3-layer NNs, an analogous statement can be made. However, only one direction can be easily seen.
Lemma 2.8.
Any function in can be represented by an -conforming -layer NN.
Proof.
As in the previous lemma, each function in is a linear combination of -term max functions by definition. Hence, by Lemma 1.2, it can be represented by a 3-layer NN. ∎
Our goal is to prove the converse as well: any -conforming function represented by a 3-layer NN is in . Since is the 30th basis function, which is linearly independent from and thus not contained in , this implies Theorem 1.7. To achieve this goal, we first provide another characterization of , which can be seen as an orthogonal direction to in . For a function , let
be a linear map from to .
Lemma 2.9.
A function is contained in if and only if .
Proof.
Any can be represented as a unique linear combination of the basis functions and is contained in if and only if the coefficient of is zero. One can easily check (with a computer) that maps all functions in to , but not the 30th basis function . Thus, is contained in if and only if it satisfies . ∎
In order to make use of our assumption that the NN is -conforming, we need the following insight about when the property of being -conforming is preserved after applying a ReLU activation.
Lemma 2.10.
Let . The function is -conforming (and thus in as well) if and only if there is no pair of sets with and being nonzero and having different signs.
Proof.
The key observation to prove this lemma is the following: for two rays and , there exists a cell of the hyperplane arrangement for which both and are extreme rays if and only if oder .
Hence, if there exists a pair of sets with and being nonzero and having different signs, then the function restricted to is a linear function with both strictly positive and strictly negative values. Therefore, after applying the ReLU activation, the resulting function has breakpoints within and is not -conforming.
Conversely, if for each pair of sets , both and are either nonpositive or nonnegative, then restricted to any cell of is either nonpositive or nonnegative everywhere. In the first case, restricted to that cell is the zero function, while in the second case, coincides with in . In both cases, is linear within all cells and, thus, -conforming. ∎
Having collected all these lemmas, we are finally able to construct an MIP whose solution proves that any function computed by an -conforming 3-layer NN is in . As in the proof of Lemma 2.7, it suffices to focus on the output of a single neuron in the second hidden layer. Let be the output of such a neuron with being its input. Observe that, by construction, is a function computed by a -layer NN, and thus, by Lemma 2.7, a linear combination of the functions in . The MIP contains three types of variables, which we denote in bold to distinguish them from constants:
-
•
continuous variables , being the coefficients of the linear combination of the basis of forming , that is, (since multiplying and with a nonzero scalar does not alter the containment of in , we may restrict the variables to ),
-
•
binary variables for , determining whether the considered neuron is strictly active at ray , that is, whether ,
-
•
continuous variables for , representing the output of the considered neuron at all rays, that is, .
To ensure that these variables interact as expected, we need two types of constraints:
-
•
For each of the rays , , the following constraints ensure that and output are correctly calculated from the variables , that is, if and only if is positive, and . Also compare the references given in Section 1.5 concerning MIP models for ReLU units. Note that the restriction of the coefficients to ensures that the absolute value of is always bounded by , allowing us to use as a replacement for :
(2) Observe that these constraints ensure that one of the following two cases occurs: If , then the first and third line imply and the second line implies that the incoming activation is in fact nonpositive. The fourth line is always satisfied in that case. Otherwise, if , then the second and fourth line imply that equals the incoming activation, and, in combination with the first line, this has to be nonnegative. The third line is always satisfied in that case. Hence, the set of constraints (2) correctly models the ReLU activation function.
-
•
For each of the pairs of sets , the following constraints ensure that the property in Lemma 2.10 is satisfied. More precisely, if one of the variables oder equals , then the ray of the other set has nonnegative activation, that is, oder , respectively:
(3) Observe that these constraints successfully prevent that the two rays and have nonzero activations with different signs. Conversely, if this is not the case, then we can always satisfy constraints (3) by setting only those variables to value where the activation of ray is strictly positive. (Note that, if the incoming activation is precisely zero, constraints (2) make it possible to choose both values oder for .) Hence, these constraints are in fact appropriate to model -conformity.
In the light of Lemma 2.9, the objective function of our MIP is to maximize , that is, the expression
The MIP has a total of 30 binary and 44 continuous variables, as well as 420 inequality constraints. The next proposition formalizes how this MIP can be used to check whether a 3-layer NN function can exist outside .
Proposition 2.11.
There exists an -conforming 3-layer NN computing a function not contained in if and only if the objective value of the MIP defined above is strictly positive.
Proof.
For the first direction, assume that such an NN exists. Since its final output is a linear combination of the outputs of the neurons in the second hidden layer, one of these neurons must compute a function , with being the input to that neuron. By Lemma 2.9, it follows that . Moreover, we can even assume without loss of generality that , as we argue now. If this is not the case, multiply all first-layer weights of the NN by to obtain a new NN computing function instead of . Observing that for all , we obtain for all . Plugging this into the definition of and using that the cardinalities of and have different parity, we further obtain . Therefore, we can assume that was already positive in the first place.
Using Lemma 2.7, the function can be represented as a linear combination of the functions in . Let . Note that because otherwise would be the zero function. Let us define modified functions and from and as follows. Let , , and . Moreover, for all rays , let , as well as if , and otherwise.
It is easy to verify that the variables , , and defined that way satisfy (2). Moreover, since the NN is -conforming, they also satisfy (3). Finally, they also yield a strictly positive objective function value since .
For the reverse direction, assume that there exists an MIP solution consisting of , , and , satisfying (2) and (3), and having a strictly positive objective function value. Define the functions and . One concludes from (2) that for all rays . Lemma 2.7 implies that can be represented by a 2-layer NN. Thus, can be represented by a 3-layer NN. Moreover, constraints (3) guarantee that this NN is -conforming. Finally, since the MIP solution has strictly positive objective function value, we obtain , implying that . ∎
In order to use the MIP as part of a mathematical proof, we employed an MIP solver that uses exact rational arithmetics without numerical errors, namely the solver by the Parma Polyhedral Library (PPL) [7]. We called the solver from a SageMath (Version 9.0) [71] script on a machine with an Intel Core i7-8700 6-Core 64-bit CPU and 15.5 GB RAM, using the openSUSE Leap 15.2 Linux distribution. SageMath, which natively includes the PPL solver, is published under the GPLv3 license. After a total running time of almost 7 days (153 hours), we obtained optimal objective function value zero. This makes it possible to prove Theorem 1.7.
Proof of Theorem 1.7.
Since the MIP has optimal objective function value zero, Proposition 2.11 implies that any function computed by an -conforming -layer NN is contained in . In particular, it is not possible to compute the function with an -conforming -layer NN. ∎
We remark that state-of-the-art MIP solver Gurobi (version 9.1.1) [30], which is commercial but offers free academic licenses, is able to solve the same MIP within less than a second, providing the same result. However, Gurobi does not employ exact arithmetics, making it impossible to exclude numerical errors and use it as a mathematical proof.
The SageMath code can be found on GitHub at
https://github.com/ChristophHertrich/relu-mip-depth-bound.
Additionally, the MIP can be found there as .mps file, a standard format to represent MIPs. This allows one to use any solver of choice to reproduce our result.
3 Going Beyond Linear Combinations of Max Functions
In this section we prove the following result, showing that NNs with hidden layers can compute more functions than only linear combinations of -term max functions.
See 1.8
In order to prove this theorem, for each number of hidden layers , we provide a specific function in . The challenging part is to show that the function is in fact not contained in .
Proposition 3.1.
For any , the function defined by
(4) |
is not contained in .
This means that cannot be written as a linear combination of -term max functions, which proves a conjecture by [74] that , which has been open since 2005. Previously, it was only known that linear combinations of -term maxes are not sufficient to represent any CPWL function defined on , that is, . Lu [48] provides a short analytical argument for this fact.
Before we prove Proposition 3.1, we show that it implies Theorem 1.8.
Proof of Theorem 1.8.
For , let . By Proposition 3.1, function defined in (4) is not contained in . It remains to show that it can be represented using a ReLU NN with hidden layers. To see this, first observe that any of the terms , for , and can be expressed by a one-hidden-layer NN since all these are (linear combinations of) -term max functions. Since is the maximum of these terms, and since the maximum of numbers can be computed with hidden layers (Lemma 1.2), this implies that is in . ∎
In order to prove Proposition 3.1, we need the concept of polyhedral complexes. A polyhedral complex is a finite set of polyhedra such that each face of a polyhedron in is also in , and for two polyhedra , their intersection is a common face of and (possibly the empty face). Given a polyhedral complex in and an integer , we let denote the collection of all -dimensional polyhedra in .
For a convex CPWL function , we define its underlying polyhedral complex as follows: it is the unique polyhedral complex covering (i.e., each point in belongs to some polyhedron in ) whose -dimensional polyhedra coincide with the domains of the (maximal) affine pieces of . In particular, is affine linear within each , but not within any strict superset of a polyhedron in .
Exploiting properties of polyhedral complexes associated with CPWL functions, we prove the following proposition below.
Proposition 3.2.
Let be a convex CPWL function and let be the underlying polyhedral complex. If there exists a hyperplane such that the set
is nonempty and contains no line, then cannot be expressed as a linear combination of -term maxima of affine linear functions.
Again, before we proceed to the proof of Proposition 3.2, we show that it implies Proposition 3.1.
Proof of Proposition 3.1.
Observe that (defined in (4)) has the alternate representation
as a maximum of terms. Let be its underlying polyhedral complex. Let the hyperplane be defined by .
Observe that any facet in is a polyhedron defined by two of the terms that are equal and at least as large as each of the remaining terms. Hence, the only facet that could possibly be contained in is
Note that is indeed an -dimensional facet in , because, for example, a small ball around intersected with is contained in .
Finally, we need to show that is pointed, that is, it contains no line. A well-known fact from polyhedral theory says if there is any line in with direction , then must satisfy the defining inequalities with equality. However, only the zero vector does this. Hence, cannot contain a line.
Therefore, when applying Proposition 3.2 to with underlying polyhedral complex and hyperplane , we have , which is nonempty and contains no line. Hence, cannot be written as linear combination of -term maxima. ∎
The remainder of this section is devoted to proving Proposition 3.2. In order to exploit properties of the underlying polyhedral complex of the considered CPWL functions, we will first introduce some terminology, notation, and results related to polyhedral complexes in for any .
Definition 3.3.
Given an abelian group , we define as the family of all functions of the form , where is a polyhedral complex that covers . We say that is the underlying polyhedral complex, or the polyhedral complex associated with .
Just to give an intuition of the reason for this definition, let us mention that later we will choose to be the set of affine linear maps with respect to the standard operation of sum of functions. Moreover, given a convex CPWL function with underlying polyhedral complex , we will consider the following function : for every , will be the affine linear map that coincides with over . It can be helpful, though not necessary, to keep this in mind when reading the next definitions and observations.
It is useful to observe that the functions in can also be described in a different way. Before explaining this, we need to define an ordering between the two elements of each pair of opposite halfspaces. More precisely, let be a hyperplane in and let be the two closed halfspaces delimited by . We choose an arbitrary rule to say that “precedes” , which we write as .111In case one wants to see such a rule explicitly, this is a possible way: Fix an arbitrary . We can say that if and only if , where is the first vector in the standard basis of that does not lie on (i.e., and ). Note that this definition does not depend on the choice of . We can then extend this ordering rule to those pairs of -dimensional polyhedra of a polyhedral complex in that share a facet. Specifically, given a polyhedral complex in , let be such that . Further, let be the unique hyperplane containing . We say that if the halfspace delimited by and containing precedes the halfspace delimited by and containing .
We can now explain the alternate description of the functions in , which is based on the following notion.
Definition 3.4.
Let , with associated polyhedral complex . The facet-function associated with is the function defined as follows: given , let be the two polyhedra in such that , where ; then we set .
Although it will not be used, we observe that knowing is sufficient to reconstruct up to an additive constant. This means that a function associated with the same polyhedral complex has the same facet-function if and only if there exists such that for every . (However, it is not true that every function is the facet-function of some function in .)
We now introduce a sum operation over .
Definition 3.5.
For functions with associated polyhedral complexes , the sum is the function in defined as follows:
-
•
the polyhedral complex associated with is
-
•
given , can be uniquely obtained as , where for every ; we then define
The term “sum” is justified by the fact that when (and thus have the same domain) we obtain the standard notion of the sum of functions.
The next results shows how to compute the facet-function of a sum of functions in .
Observation 3.6.
With the notation of Definition 3.5, let be the facet-functions associated with , and let be the facet-function associated with . Given , let be the set of indices such that contains a (unique) element with . Then
(5) |
Proof.
Let be the two polyhedra in such that , with . We have and for a unique choice of for every . Then
(6) |
Now fix . Since , . If , then and . Furthermore, because . If, on the contrary, , the fact that is a polyhedral complex implies that , and thus . Moreover, in this case : this is because , which implies that the relative interior of is contained in the relative interior of . With these observations, from (6) we obtain (5). ∎
Definition 3.7.
Fix , with associated polyhedral complex . Let be a hyperplane in , and let be the closed halfspaces delimited by . Define the polyhedral complex
The refinement of with respect to is the function with associated polyhedral complex defined as follows: given , , where is the unique polyhedron in that contains .
The next results shows how to compute the facet-function of a refinement.
Observation 3.8.
With the notation of Definition 3.7, let be the facet-function associated with . Then, the facet-function associated with is given by
for every .
Proof.
Let be the polyhedra in such that , with . Further, let be the unique polyhedra in that contain (respectively). It might happen that .
If there is containing , then the fact that is a polyhedral complex implies that . Note that and in this case. Thus .
Assume now that no element of contains . Then there exists such that and intersects the interior of . Note that in this case. Then and (or vice versa). It follows that . ∎
We now prove that the operations of sum and refinement commute: the refinement of a sum is the sum of the refinements.
Observation 3.9.
Let be functions with associated polyhedral complexes . Define . Let be a hyperplane in , and let be the closed halfspaces delimited by . Then .
Proof.
Define . It can be verified that and are defined on the same poyhedral complex, which we denote by . We now fix and show that .
Since , it is -dimensional and either contained in oder . Since both cases are symmetric, let us focus on . This means, we can write it as , where for every . Then
where the first and third equations follow from the definition of refinement, while the second and fourth equations follow from the definition of the sum. ∎
The lineality space of a (nonempty) polyhedron is the null space of the constraint matrix . In other words, it is the set of vectors such that for every the whole line is a subset of . We say that the lineality space of is trivial, if it contains only the zero vector, and nontrivial otherwise.
Given a polyhedron , it is well-known that all nonempty faces of share the same lineality space. Therefore, given a polyhedral complex that covers , all the nonempty polyhedra in share the same lineality space . We will call the lineality space of .
Lemma 3.10.
Given an abelian group , pick , with associated polyhedral complexes . Assume that for every the lineality space of is nontrivial. Define , as the underlying polyhedral complex, and as the facet-function of . Then for every hyperplane , the set
is either empty or contains a line.
Proof.
The proof is by induction on . For , the assumptions imply that all are equal to , and each of these polyhedral complexes has as its only nonempty face. Since is empty, no hyperplane such that can exist.
Now fix . Assume by contradiction that there exists a hyperplane such that is nonempty and contains no line. Let be the refinement of with respect to , be the underlying polyhedral complex, and be the associated facet-function. Further, we define , which is a polyhedral complex that covers . Note that if is identified with then we can think of as a polyhedral complex that covers , and the restriction of to , which we denote by , can be seen as a function in . We will prove that does not satisfy the lemma, contradicting the inductive hypothesis.
Since , by 3.9 we have . Note that for every the hyperplane is covered by the elements of . This implies that for every and there exists such that . Then, by 3.6, .
Now, additionally suppose that is contained in , that is, . Let be such that the lineality space of is not a subset of the linear space parallel to . Then no element of contains . By 3.8, . We then conclude that
where is the set of indices such that the lineality space of is a subset of the linear space parallel to . This means that
where is the restriction of to , with . Note that for every the lineality space of is clearly nontrivial, as it coincides with the lineality space of .
Now pick any . Note that if there exists such that , then . It then follows from 3.8 that
In other words,
(7) |
Since (as contains no line), there exists a polyhedron such that and has a facet which does not belong to any other polyhedron in contained in . Then the facet-function associated with satisfies . Let be the -dimensional affine space containing . Then the set
is nonempty, as . Furthermore, we claim that contains no line. To see why this is true, take any such that and , and let be the two polyhedra in having as facet. Then , and thus at least one of these values (say ) is nonzero. Then, by (7), , and thus also . This shows that and therefore contains no line.
We have shown that does not satisfy the lemma. This contradicts the inductive assumption that the lemma holds in dimension . ∎
Finally, we can use this lemma to prove Proposition 3.2.
Proof of Proposition 3.2.
Assume for the sake of a contradiction that
where , and is an affine linear function for every and . Define for every , which is a CPWL function.
Fix any such that . Then is convex. Note that its epigraph
is a polyhedron in defined by inequalities, and thus has nontrivial lineality space. Furthermore, no line orthogonal to the -space is contained in . Since the underlying polyhedral complex of consists of the orthogonal projections of the faces of (excluding itself) onto the -space, this implies that has also nontrivial lineality space. (More precisely, the lineality space of is the projection of the lineality space of .)
If , then is concave. By arguing as above on the convex function , one obtains that the underlying polyhedral complex has again nontrivial lineality space. Thus this property holds for every .
The set of affine linear functions forms an abelian group (with respect to the standard operation of sum of functions), which we denote by . For every , let be the function in with underlying polyhedral complex defined as follows: for every , is the affine linear function that coincides with over . Define and let be the underlying polyhedral complex.
Note that for every , is precisely the affine linear function that coincides with within . However, may not coincide with , as there might exist sharing a facet such that ; when this happens, is affine linear over and therefore and are merged together in . Nonetheless, is a refinement of , i.e., for every there exist (for some ) such that . Moreover, . Denoting by the facet-function associated with , this implies for a facet that if and only if is not subset of any facet .
Let be a hyperplane as in the statement of the proposition. The above discussion shows that
Using , we obtain a contradiction to Lemma 3.10. ∎
4 A Width Bound for Neural Networks with Small Depth
While the proof of Theorem 1.1 by Arora et al. [6] shows that
it does not provide any bound on the width of the NN required to represent any particular CPWL function. The purpose of this section is to prove that for fixed dimension , the required width for exact, depth-minimal representation of a CPWL function can be polynomially bounded in the number of affine pieces; specifically by . This improves previous bounds by He et al. [35] and is closely related to works that bound the number of linear pieces of an NN as a function of the size [55, 59, 61, 54]. It can also be seen as a counterpart, in the context of exact representations, to quantitative universal approximation theorems that bound the number of neurons required to achieve a certain approximation guarantee; see, e.g., [8, 9, 60, 52, 53].
4.1 The Convex Case
We first derive our result for the case of convex CPWL functions and then use this to also prove the general nonconvex case. Our width bound is a consequence of the following theorem about convex CPWL functions, for which we are going to provide a geometric proof later.
Theorem 4.1.
Let be a convex CPWL function with pieces defined on . Then can be written as
with coefficients .
For the convex case, this yields a stronger version of Theorem 1.3, stating that any (not necessarily convex) CPWL function can be written as a linear combination of -term maxima. Theorem 4.1 is stronger in the sense that it guarantees that all pieces of the -term maxima must be pieces of the original function. This makes it possible to bound the total number of these -term maxima and, therefore, the size of an NN representing , as we will see in the proof of the following theorem.
Theorem 4.2.
Let be a convex CPWL function with affine pieces. Then can be represented by a ReLU NN with depth and width .
Proof.
Using the representation of Theorem 4.1, we can construct an NN computing by computing all the -term max functions in parallel with the construction of Lemma 1.2 (similar to the proof by Arora et al. [6] to show Theorem 1.1). This results in an NN with the claimed depth. Moreover, the width is at most a constant times the number of these -term max functions. This number can be bounded in terms of the number of possible subsets with , which is at most . ∎
Before we present the proof of Theorem 4.1, we show how we can generalize its consequences to the nonconvex case.
4.2 The General (Nonconvex) Case
It is a well-known fact that every CPWL function can be expressed as a difference of two convex CPWL functions, see, e.g., [73, Theorem 1]. This allows us to derive the general case from the convex case. What we need, however, is to bound the number of affine pieces of the two convex CPWL functions in terms of the number of pieces of the original function. Therefore, we consider a specific decomposition for which such bounds can easily be achieved.
Proposition 4.3.
Let be a CPWL function with affine pieces. Then, one can write as where both and are convex CPWL functions with at most pieces.
Proof.
Suppose the affine pieces of are given by , . Define the function and let . Then, obviously, . It remains to show that both and are convex CPWL functions with at most pieces.
The convexity of is clear by definition. Consider the hyperplanes given by , . They divide into at most regions (compare [20, Theorem 1.3]) in each of which is affine. In particular, has at most pieces.
Next, we show that is convex. Intuitively, this holds because each possible breaking hyperplane of is made convex by adding . To make this formal, note that by the definition of convexity, it suffices to show that is convex along each affine line. For this purpose, consider an arbitrary line , , given by and . Let , , and . We need to show that is a convex function. Observe that , , and are clearly one-dimensional CPWL functions with the property . Hence, it suffices to show that is locally convex around each of its breakpoints. Let be an arbitrary breakpoint of . If is already locally convex around , then the same holds for as well since inherits convexity from . Now suppose that is a nonconvex breakpoint of . Then there exist two distinct pieces of , indexed by with , such that for all sufficiently close to . By construction, contains the summand . Thus, adding this summand to linearizes the nonconvex breakpoint of , while adding all the other summands preserves convexity. In total, is locally convex around , which finishes the proof that is a convex function.
Finally, observe that pieces of are always intersections of pieces of and , for which we have only possibilities. ∎
Having this, we may conclude the following.
See 1.9
Proof.
Consider the decomposition from Proposition 4.3. Using Theorem 4.2, we obtain that both and can be represented with the required depth and with width . Thus, the same holds true for . ∎
4.3 Extended Newton Polyhedra of Convex CPWL Functions
For our proof of Theorem 4.1, we use a correspondence of convex CPWL functions with certain polyhedra, which are known as (extended) Newton polyhedra in tropical geometry [49]. These relations between tropical geometry and neural networks have previously been applied to investigate expressivity of NNs; compare our references in Section 1.5.
In order to formalize this correspondence, let be the set of convex CPWL functions of type . For in , we define its so-called extended Newton polyhedron to be
where the “+” stands for Minkowski addition. We denote the set of all possible extended Newton polyhedra in as . That is, is the set of (unbounded) polyhedra in that emerge from a polytope by adding the negative of the -st unit vector as an extreme ray. Hence, a set is an element of if and only if can be written as
Conversely, for a polyhedron of this form, let be the function defined by .
There is an intuitive way of thinking about the extended Newton polyhedron of a convex CPWL function : it consists of all hyperplane coefficients such that for all . This also explains why we add the extreme ray : decreasing obviously maintains the property of being a lower bound on the function . Hence, if a point belongs to the extended Newton polyhedron , then also all points with should belong to it. Thus, should be contained in the recession cone of .
In fact, there is a one-to-one correspondence between elements of and , which is nicely compatible with some (functional and polyhedral) operations. This correspondence has been studied before in tropical geometry [49, 41], convex geometry222 is the negative of the epigraph of the convex conjugate of . [39], as well as neural network literature [76, 15, 2, 54]. We summarize the key findings about this correspondence relevant to our work in the following proposition:
Proposition 4.4.
Let and . Then it holds that
-
(i)
the functions and are well-defined, that is, their output is independent from the representation of the input by pieces or vertices, respectively,
-
(ii)
and are bijections and inverse to each other,
-
(iii)
,
-
(iv)
, where the on the right-hand side is Minkowski addition.
An algebraic way of phrasing this proposition is as follows: and are isomorphisms between the semirings and .
4.4 Proof of Theorem 4.1
The rough idea to prove Theorem 4.1 is as follows. Suppose we have a -term max function with . By Proposition 4.4, corresponds to a polyhedron with at least vertices. Applying a classical result from discrete geometry known as Radon’s theorem allows us to carefully decompose into a “signed”333Some polyhedra may occur with “negative” coefficents in that sum, meaning that they are actually added to instead of the other polyhedra. The corresponding CPWL functions will then have negative coefficients in the linear combination representing . Minkowski sum of polyhedra in whose vertices are subsets of at most out of the vertices of . Translating this back into the world of CPWL functions by Proposition 4.4 yields that can be written as linear combination of -term maxima with , where each of them involves a subset of the affine terms of . We can then obtain Theorem 4.1 by iterating until every occurring maximum expression involves at most terms.
We start with a proposition that will be useful for our proof of Theorem 4.1. Although its statement is well-known in the discrete geometry community, we include a proof for the sake of completeness. To show the proposition, we make use of Radon’s theorem (compare [20, Theorem 4.1]), stating that any set of at least points in can be partitioned into two nonempty subsets such that their convex hulls intersect.
Proposition 4.5.
Given vectors , , there exists a nonempty subset featuring the following property: there is no with and such that
(8) |
Proof.
Radon’s theorem applied to the at least vectors , , yields a nonempty subset and coefficients with such that . Suppose that without loss of generality (otherwise exchange the roles of and ).
The following proposition is a crucial step in order to show that any convex CPWL function with pieces can be expressed as an integer linear combination of convex CPWL functions with at most pieces.
Proposition 4.6.
Let be a convex CPWL function defined on with . Then there exist a subset such that
(9) |
Proof.
Consider the vectors , . Choose according to Proposition 4.5. We show that this choice of guarantees equation (9).
For , let and consider its extended Newton polyhedron . By Proposition 4.4, equation (9) is equivalent to
where the sums are Minkowski sums.
We show this equation by showing that for all vectors it holds that
(10) |
Let be an arbitrary vector. If , both sides of (10) are infinite. Hence, from now on, assume that . Then, both sides of (10) are finite since is the only extreme ray of all involved polyhedra.
Due to our choice of according to Proposition 4.5, there exists an index such that
(11) |
We define a bijection between the even and the odd subsets of as follows:
That is, changes the parity of by adding or removing . Considering the corresponding polyhedra and , this means that adds or removes the extreme point to or from . Due to (11) this does not change the optimal value of maximizing in -direction over the polyhedra, that is,
Hence, we may conclude
which proves (10). Thus, the claim follows. ∎
With the help of this result, we can now prove Theorem 4.1.
Proof of Theorem 4.1.
Let be a convex CPWL function defined on . Having a closer look at the statement of Proposition 4.6, observe that only one term at the left-hand side of (9) contains all affine combinations . Putting all other maximum terms on the other side, we may write as an integer linear combination of maxima of at most summands. Repeating this procedure until we have eliminated all maximum terms with more than summands yields the desired representation. ∎
4.5 Potential Approaches to Show Lower Bounds on the Width
In light of the upper width bounds shown in this section, a natural question to ask is whether also meaningful lower bounds can be achieved. This would mean constructing a family of CPWL functions with pieces defined on (with different values of and ), for which we can prove that a large width is required to represent these functions with NNs of depth .
A trivial and not very satisfying answer follows, e.g., from [61] oder [67]: for fixed input dimension , they show that a function computed by an NN with hidden layers and width has at most pieces. For our setting, this means that an NN with logarithmic depth needs a width of at least to represent a function with pieces. This is, of course, very far away from our upper bounds.
Similar upper bounds on the number of pieces have been proven by many other authors and are often used to show depth-width trade-offs [55, 54, 59, 70, 6]. However, there is a good reason why all these results only give rise to very trivial lower bounds for our setting: the focus is always on functions with considerably many pieces, which then, consequently, need many neurons to be represented (with small depth). However, since the lower bounds we strive for depend on the number of pieces, we would need to construct a family of functions with comparably few pieces that still need a lot of neurons to be represented. In general, it seems to be a tough task to argue why such functions should exist.
A different approach could leverage methods from complexity theory, in particular from circuit complexity. Neural networks are basically arithmetic circuits with very special operations allowed. In fact, they can be seen as a tropical variant of arithmetic circuits. Showing circuit lower bounds is a notoriously difficult task in complexity theory, but maybe some conditional result (based on common conjectures similar to P NP) could be established.
We think that the question whether our bounds are tight, or whether at least some non-trivial lower bounds on the width for NNs with logarithmic depth can be shown, is an exciting question for further research.
5 Understanding Expressivity via Newton Polytopes
In Section 2, we presented a mixed-integer programming approach towards proving that deep NNs can strictly represent more functions than shallow ones. However, even if we could prove that it is indeed enough to consider -conforming NNs, this approach would not generalize to deeper networks due to computational limitations. Therefore, different ideas are needed to prove 1.4 in its full generality. In this section, we point out that Newton polytopes of convex CPWL functions (similar to what we used in the previous section) could also be a way of proving 1.4. Using a homogenized version of Proposition 4.4, we provide an equivalent formulation of 1.4 that is completely phrased in the language of discrete geometry.
Recall that, by Proposition 2.3, we may restrict ourselves to NNs without biases. In particular, all CPWL functions represented by such NNs, or parts of it, are positively homogeneous. For the associated extended Newton polyhedra (compare Proposition 4.4), this has the following consequence: all vertices lie in the hyperplane , that is, their -st coordinate is . Therefore, the extended Newton polyhedron of a positively homogeneous, convex CPWL function is completely characterized by the so-called Newton polytope, that is, the polytope .
To make this formal, let be the set of all positively homogeneous, convex CPWL functions of type and let be the set of all convex polytopes in . Moreover, for in , let
be the associated Newton polytope of and for let
be the so-called associated support function [38] of in . With this notation, we obtain the following variant of Proposition 4.4.
Proposition 5.1.
Let and . Then it holds that
-
(i)
the functions and are well-defined, that is, their output is independent from the representation of the input by pieces or vertices, respectively,
-
(ii)
and are bijections and inverse to each other,
-
(iii)
,
-
(iv)
, where the on the right-hand side is Minkowski addition.
In other words, and are isomorphisms between the semirings and .
Next, we study which polytopes can appear as Newton polytopes of convex CPWL functions computed by NNs with a certain depth; compare Zhang et al. [76].
Before we apply the first ReLU activation, any function computed by an NN is linear. Thus, the corresponding Newton polytope is a single point. Starting from that, let us investigate a neuron in the first hidden layer. Here, the ReLU activation function computes a maximum of a linear function and . Therefore, the Newton polytope of the resulting function is the convex hull of two points, that is, a line segment. After the first hidden layer, arbitrary many functions of this type can be added up. For the corresponding Newton polytopes, this means that we take the Minkowski sum of line segments, resulting in a so-called zonotope.
Now, this construction can be repeated layerwise, making use of Proposition 5.1: in each hidden layer, we can compute the maximum of two functions computed by the previous layers, which translates to obtaining the new Newton polytope as a convex hull of the union of the two original Newton polytopes. In addition, the linear combinations between layers translate to scaling and taking Minkowski sums of Newton polytopes.
This intuition motivates the following definition. Let be the set of all polytopes in that consist only of a single point. Then, for each , we recursively define
where the sum is a Minkowski sum of polytopes. A first, but not precisely accurate interpretation is as follows: the set contains the Newton polytopes of positively homogeneous, convex CPWL functions representable with a -hidden-layer NN. See Figure 5 for an illustration of the case .
Unfortunately, this interpretation is not accurate for the following reason: our NNs are allowed to have negative weights, which cannot be fully captured by Minkowski sums as introduced above. Therefore, it might be possible that a -hidden-layer NN can compute a convex function with Newton polytope not in . Luckily, one can remedy this shortcoming, and even extend the interpretation to the non-convex case, by representing the computed function as difference of two convex functions.
Theorem 5.2.
A positively homogeneous (not necessarily convex) CPWL function can be computed by a -hidden-layer NN if and only if it can be written as the difference of two positively homogeneous, convex CPWL functions with Newton polytopes in .
Proof.
We use induction on . For , the statement is clear since it holds precisely for linear functions. For the induction step, suppose that, for some , the equivalence is valid up to hidden layers. We prove that it is also valid for hidden layers.
We need to show two directions. For the first direction, assume that is an arbitrary, positively homogeneous CPWL function that can be written as with . We need to show that a -hidden-layer NN can compute . We show that this is even true for and , and hence, also for . By definition of , there exist a finite number and polytopes , , such that . By Proposition 5.1, we have . By induction, and can be computed by NNs with hidden layers. Since the maximum terms can be computed with a single hidden layer, in total a -th hidden layer is sufficient to compute . An analogous argument applies to . Thus, is computable with hidden layers, completing the first direction.
For the other direction, suppose that is an arbitrary, positively homogeneous CPWL function that can be computed by a -hidden-layer NN. Let us separately consider the neurons in the -th hidden layer of the NN. Let , , be the weight of the connection from the -th neuron in that layer to the output. Without loss of generality, we have , because otherwise we can normalize it and multiply the weights of the incoming connections to the -th neuron with instead. Moreover, let us assume that, by potential reordering, there is some such that for and for . With these assumptions, we can write
(12) |
where each is computable by a -hidden-layer NN, namely the sub-NN computing the input to the -th neuron in the -th hidden layer.
By induction, we obtain for some positively homogeneous, convex functions with . We then have
(13) |
We define
and
Note that and are convex by construction as a sum of convex functions and that (12) and (13) imply . Moreover, by Proposition 5.1,
and
Hence, can be represented as desired, completing also the other direction. ∎
The power of Theorem 5.2 lies in the fact that it provides a purely geometric characterization of the class . The classes of polytopes are solely defined by the two simple geometric operations Minkowski sum and convex hull of the union. Therefore, understanding the class is equivalent to understanding what polytopes one can generate by iterative application of these geometric operations.
In particular, we can give yet another equivalent reformulation of our main conjecture. To this end, let the simplex denote the Newton polytope of the function for each .
Conjecture 5.3.
For every , , there does not exist a pair of polytopes with (Minkowski sum).
Proof.
By Proposition 1.6, it suffices to show equivalence between 5.3 and 1.5. By Theorem 5.2, can be represented with hidden layers if and only if there are functions and with Newton polytopes in satisfying . By Proposition 5.1, this happens if and only if there are polytopes with . ∎
It is particularly interesting to look at special cases with small . For , the set is the set of all zonotopes. Hence, the (known) statement that cannot be computed with one hidden layer [56] is equivalent to the fact that the Minkowski sum of a zonotope and a triangle can never be a zonotope.
The first open case is the case . An unconditional proof that two hidden layers do not suffice to compute the maximum of five numbers is highly desired. In the regime of Newton polytopes, this means to understand the class . It consists of finite Minkowski sums of polytopes that arise as the convex hull of the union of two zonotopes. Hence, the major open question here is to classify this set of polytopes.
Finally, let us remark that there exists a generalization of the concept of polytopes, known as virtual polytopes [58], that makes it possible to assign a Newton polytope also to non-convex CPWL functions. This makes use of the fact that every (non-convex) CPWL function is a difference of two convex ones. Consequently, a virtual polytope is a formal Minkowski difference of two ordinary polytopes. Using this concept, Theorem 5.2 and 5.3 can be phrased in a simpler way, replacing the pair of polytopes with a single virtual polytope.
6 Future Research
The most obvious and, at the same time, most exciting open research question is to prove or disprove 1.4, or equivalently 1.5 oder 5.3. The first step could be to prove that it is indeed enough to consider -conforming NNs. This is intuitive because every breakpoint introduced at any place outside the hyperplanes needs to be canceled out later. Therefore, it is natural to assume that these breakpoints do not have to be introduced in the first place. However, this intuition does not seem to be enough for a formal proof because it could occur that additional breakpoints in intermediate steps, which are canceled out later, also influence the behavior of the function at other places where we allow breakpoints in the end.
Another step towards resolving our conjecture may be to find an alternative proof of Theorem 1.7, not using -conforming NNs. This might also be beneficial for generalizing our techniques to more hidden layers, since, while theoretically possible, a direct generalization of the MIP approach is infeasible due to computational limitations. For example, it might be particularly promising to use a tropical approach as described in Section 5 and apply methods from polytope theory to prove 5.3.
In light of our results from Section 3, it would be desirable to provide a complete characterization of the functions contained in . Another potential research goal is improving our upper bounds on the width from Section 4 and/or proving matching lower bounds as discussed in Section 4.5.
Some more interesting research directions are the following:
-
•
establishing or strengthening our results for special classes of NNs like recurrent neural networks (RNNs) or convolutional neural networks (CNNs),
-
•
using exact representation results to show more drastic depth-width trade-offs compared to existing results in the literature,
-
•
understanding how the class changes when a polynomial upper bound is imposed on the width of the NN; see related work by Vardi et al. [72].
- •
References
- [1] M. Abrahamsen, L. Kleist, and T. Miltzow. Training neural networks is ER-complete. Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.
- [2] M. Alfarra, A. Bibi, H. Hammoud, M. Gaafar, and B. Ghanem. On the decision boundaries of neural networks: A tropical geometry perspective. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- [3] A. M. Alvarez, Q. Louveaux, and L. Wehenkel. A machine learning-based approximation of strong branching. INFORMS Journal on Computing, 29(1):185–195, 2017.
- [4] R. Anderson, J. Huchette, W. Ma, C. Tjandraatmadja, and J. P. Vielma. Strong mixed-integer programming formulations for trained neural networks. Mathematical Programming, pages 1–37, 2020.
- [5] M. Anthony and P. L. Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 1999.
- [6] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018.
- [7] R. Bagnara, P. M. Hill, and E. Zaffanella. The Parma Polyhedra Library: Toward a complete set of numerical abstractions for the analysis and verification of hardware and software systems. Science of Computer Programming, 72(1–2):3–21, 2008.
- [8] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
- [9] A. R. Barron. Approximation and estimation bounds for artificial neural networks. Machine learning, 14(1):115–133, 1994.
- [10] Y. Bengio, A. Lodi, and A. Prouvost. Machine learning for combinatorial optimization: a methodological tour d’horizon. European Journal of Operational Research, 2020.
- [11] D. Bertschinger, C. Hertrich, P. Jungeblut, T. Miltzow, and S. Weber. Training fully connected neural networks is ER-complete. arXiv:2204.01368, 2022.
- [12] D. Bienstock, G. Muñoz, and S. Pokutta. Principled deep neural network training through linear programming. arXiv:1810.03218, 2018.
- [13] P. Bonami, A. Lodi, and G. Zarpellon. Learning a classification of mixed-integer quadratic programming problems. In International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 595–604. Springer, 2018.
- [14] D. Boob, S. S. Dey, and G. Lan. Complexity of training relu neural network. Discrete Optimization, 44, 2022.
- [15] V. Charisopoulos and P. Maragos. A tropical approach to neural networks with piecewise linear activations. arXiv preprint arXiv:1805.08749, 2018.
- [16] K.-L. Chen, H. Garudadri, and B. D. Rao. Improved bounds on neural complexity for representing piecewise linear functions. In Advances in Neural Information Processing Systems, 2022.
- [17] S. Chen, A. R. Klivans, and R. Meka. Learning Deep ReLU Networks Is Fixed-Parameter Tractable. In N. K. Vishnoi, editor, 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 696–707, 2022.
- [18] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
- [19] S. S. Dey, G. Wang, and Y. Xie. Approximation algorithms for training one-node relu neural networks. IEEE Transactions on Signal Processing, 68:6696–6706, 2020.
- [20] H. Edelsbrunner. Algorithms in Combinatorial Geometry. Springer Science & Business Media, 1987.
- [21] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pages 907–940, 2016.
- [22] M. Fischetti and J. Jo. Deep neural networks as 0-1 mixed integer linear programs: A feasibility study. arXiv preprint arXiv:1712.06174, 2017.
- [23] V. Froese, C. Hertrich, and R. Niedermeier. The computational complexity of ReLU network training parameterized by data dimensionality. Journal of Artificial Intelligence Research, 74:1775–1790, 2022.
- [24] M. Gasse, D. Chételat, N. Ferroni, L. Charlin, and A. Lodi. Exact combinatorial optimization with graph convolutional neural networks. Advances in neural information processing systems, 32, 2019.
- [25] S. Goel, V. Kanade, A. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. In Conference on Learning Theory, pages 1004–1042. PMLR, 2017.
- [26] S. Goel, A. Klivans, and R. Meka. Learning one convolutional layer with overlapping patches. In International Conference on Machine Learning, pages 1783–1791. PMLR, 2018.
- [27] S. Goel and A. R. Klivans. Learning neural networks with two nonlinear layers in polynomial time. In Conference on Learning Theory, pages 1470–1499. PMLR, 2019.
- [28] S. Goel, A. R. Klivans, P. Manurangsi, and D. Reichman. Tight hardness results for training depth-2 ReLU networks. In 12th Innovations in Theoretical Computer Science Conference (ITCS ’21), volume 185 of LIPIcs, pages 22:1–22:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
- [29] R. Gribonval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approximation spaces of deep neural networks. Constructive Approximation, pages 1–109, 2021.
- [30] Gurobi Optimization, LLC. Gurobi optimizer reference manual, 2021.
- [31] C. A. Haase, C. Hertrich, and G. Loho. Lower bounds on the depth of integral ReLU neural networks via lattice polytopes. In The Eleventh International Conference on Learning Representations, 2023.
- [32] B. Hanin. Universal function approximation by deep neural nets with bounded width and ReLU activations. Mathematics, 7(10):992, 2019.
- [33] B. Hanin and M. Sellke. Approximating continuous functions by ReLU nets of minimal width. arXiv:1710.11278, 2017.
- [34] H. He, H. Daume III, and J. M. Eisner. Learning to search in branch and bound algorithms. Advances in neural information processing systems, 27:3293–3301, 2014.
- [35] J. He, L. Li, J. Xu, and C. Zheng. Relu deep neural networks and linear finite elements. Journal of Computational Mathematics, 38(3):502–527, 2020.
- [36] C. Hertrich and L. Sering. ReLU neural networks of polynomial size for exact maximum flow computation. In International Conference on Integer Programming and Combinatorial Optimization, 2023.
- [37] C. Hertrich and M. Skutella. Provably good solutions to the knapsack problem via neural networks of bounded size. In AAAI Conference on Artificial Intelligence, 2021.
- [38] J.-B. Hiriart-Urruty and C. Lemaréchal. Convex analysis and minimization algorithms I, volume 305 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin, 1993.
- [39] J.-B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms II, volume 306 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin, 1993.
- [40] K. Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
- [41] M. Joswig. Essentials of tropical combinatorics. Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2022. To appear.
- [42] S. Khalife and A. Basu. Neural networks with linear threshold activations: structure and algorithms. In International Conference on Integer Programming and Combinatorial Optimization, pages 347–360. Springer, 2022.
- [43] E. Khalil, P. Le Bodic, L. Song, G. Nemhauser, and B. Dilkina. Learning to branch in mixed integer programming. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
- [44] E. B. Khalil, B. Dilkina, G. L. Nemhauser, S. Ahmed, and Y. Shao. Learning to run heuristics in tree search. In IJCAI, pages 659–666, 2017.
- [45] M. Kruber, M. E. Lübbecke, and A. Parmentier. Learning when to use a decomposition. In International Conference on AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, pages 202–210. Springer, 2017.
- [46] S. Liang and R. Srikant. Why deep neural networks for function approximation? In International Conference on Learning Representations, 2017.
- [47] A. Lodi and G. Zarpellon. On learning and branching: a survey. TOP, 25(2):207–236, 2017.
- [48] Z. Lu. A note on the representation power of GHHs. arXiv:2101.11286, 2021.
- [49] D. Maclagan and B. Sturmfels. Introduction to tropical geometry, volume 161 of Graduate Studies in Mathematics. American Mathematical Soc., 2015.
- [50] P. Maragos, V. Charisopoulos, and E. Theodosis. Tropical geometry and machine learning. Proceedings of the IEEE, 109(5):728–755, 2021.
- [51] H. Mhaskar. Approximation of real functions using neural networks. In Proc. Intl. Conf. Comp. Math., New Delhi, India, World Scientific Press, pages 267–278. World Scientific, 1993.
- [52] H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1):164–177, 1996.
- [53] H. N. Mhaskar and C. A. Micchelli. Degree of approximation by neural and translation networks with a single hidden layer. Advances in applied mathematics, 16(2):151–183, 1995.
- [54] G. Montúfar, Y. Ren, and L. Zhang. Sharp bounds for the number of regions of maxout networks and vertices of minkowski sums. SIAM Journal on Applied Algebra and Geometry, 6(4):618–649, 2022.
- [55] G. F. Montúfar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2924–2932. 2014.
- [56] A. Mukherjee and A. Basu. Lower bounds over boolean inputs for deep neural networks with ReLU gates. arXiv:1711.03073, 2017.
- [57] Q. Nguyen, M. C. Mukkamala, and M. Hein. Neural networks should be wide enough to learn disconnected decision regions. In International Conference on Machine Learning, pages 3737–3746, 2018.
- [58] G. Y. Panina and I. Streĭnu. Virtual polytopes. Uspekhi Mat. Nauk, 70(6(426)):139–202, 2015.
- [59] R. Pascanu, G. Montúfar, and Y. Bengio. On the number of inference regions of deep feed forward networks with piece-wise linear activations. In International Conference on Learning Representations, 2014.
- [60] A. Pinkus. Approximation theory of the mlp model. Acta Numerica 1999: Volume 8, 8:143–195, 1999.
- [61] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. S. Dickstein. On the expressive power of deep neural networks. In International Conference on Machine Learning, pages 2847–2854, 2017.
- [62] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
- [63] I. Safran and O. Shamir. Depth-width tradeoffs in approximating natural functions with neural networks. In International Conference on Machine Learning, pages 2979–2987, 2017.
- [64] A. Schrijver. Theory of Linear and Integer Programming. John Wiley and Sons, New York, 1986.
- [65] T. Serra, A. Kumar, and S. Ramalingam. Lossless compression of deep neural networks. In International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 417–430. Springer, 2020.
- [66] T. Serra and S. Ramalingam. Empirical bounds on linear regions of deep rectifier networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5628–5635, 2020.
- [67] T. Serra, C. Tjandraatmadja, and S. Ramalingam. Bounding and counting linear regions of deep neural networks. In International Conference on Machine Learning, pages 4565–4573, 2018.
- [68] R. P. Stanley. An introduction to hyperplane arrangements. In Lecture notes, IAS/Park City Mathematics Institute, 2004.
- [69] M. Telgarsky. Representation benefits of deep feedforward networks. arXiv:1509.08101, 2015.
- [70] M. Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory, pages 1517–1539, 2016.
- [71] The Sage Developers. SageMath, the Sage Mathematics Software System (Version 9.0), 2020. https://www.sagemath.org.
- [72] G. Vardi, D. Reichman, T. Pitassi, and O. Shamir. Size and depth separation in approximating benign functions with neural networks. In Conference on Learning Theory, pages 4195–4223. PMLR, 2021.
- [73] S. Wang. General constructive representations for continuous piecewise-linear functions. IEEE Transactions on Circuits and Systems I: Regular Papers, 51(9):1889–1896, 2004.
- [74] S. Wang and X. Sun. Generalization of hinging hyperplanes. IEEE Transactions on Information Theory, 51(12):4425–4431, 2005.
- [75] D. Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
- [76] L. Zhang, G. Naitzat, and L.-H. Lim. Tropical geometry of deep neural networks. In International Conference on Machine Learning, pages 5819–5827, 2018.