Metaheuristic optimization deals with optimization problems using metaheuristic algorithms. Optimization is essentially everywhere, from engineering design to economics and from holiday planning to Internet routing. As money, resources and time are always limited, the optimal utility of these available resources is crucially important.
Most real-world optimizations are highly nonlinear and multimodal, under various complex constraints. Different objectives are often conflicting. Even for a single objective, sometimes, optimal solutions may not exist at all. In general, finding an optimal solution or even sub-optimal solutions is not an easy task. This article aims to introduce the fundamentals of metaheuristic optimization, as well as some popular metaheuristic algorithms.
Contents |
In the simplest sense, an optimization can be considered as a minimization or maximization problem. For example, the function \(f(x)=x^2\) has a minimum \(f_{\min}=0\) at \(x=0\) in the whole domain \(-\infty < x < \infty\ .\) In general, if a function is simple enough, we can use the first derivative \(f'(x)=0\) to determine the potential locations, and use the second derivative \(f''(x)\) to verify if the solution is a maximum or minimum. However, for nonlinear, multimodal, multivariate functions, this is not an easy task. In addition, some functions may have discontinuities, and thus derivative information is not easy to obtain. This may pose various challenges to many traditional methods such as hill-climbing.
In general, an optimization problem can be written as \[\tag{1} \textrm{minimize } \ f_1(\mathbf{x}), ..., f_i(\mathbf{x}), ..., f_I(\mathbf{x}), \quad \mathbf{x}=(x_1, ..., x_d), \]
subject to \[ h_j(\mathbf{x})=0, (j=1,2, ..., J) \] \[ g_k(\mathbf{x}) \le 0, (k=1,2,...,K), \] where \(f_1,..., f_I\) are the objectives, while \(h_j\) and \(g_k\) are the equality and inequality constraints, respectively. In the case when \(I=1\ ,\) it is called single-objective optimization. When \(I \ge 2\ ,\) it becomes a multiobjective problem whose solution strategy is different from those for a single objective. This article mainly concerns single-objective optimization problems. In general, all the functions \(f_i, h_j\) and \(g_k\) are nonlinear. In the special case when all these functions are linear, the optimization problem becomes a linear programming problem which can be solved using the standard simplex method (Dantzig 1963). Metaheuristic optimization concerns more generalized, nonlinear optimization problems. It is worth pointing out that the above minimization problem can also be formulated as a maximization problem if \(f_i\) is replaced with \(-f_i\ .\) Similarly, the inequality constraints \(g_k\) can also be written the other way around if \(g_k\) is replaced by \(-g_k\ .\)
Obviously, the simplest case of optimization is unconstrained function optimization. For example, multivariate and multimodal test functions are often used for validating new optimization algorithms, and a good example is Ackley's function (see Fig. 1) which has a global minimum \(f_{\min}=0\) at (0,0).
To solve the optimization problem (1), efficient search or optimization algorithms are needed. There are many optimization algorithms which can be classified in many ways, depending on the focus and characteristics.
If the derivative or gradient of a function is the focus, optimization can be classified into gradient-based algorithms and derivative-free or gradient-free algorithms. Gradient-based algorithms such as hill-climbing use derivative information, and they are often very efficient. Derivative-free algorithms do not use any derivative information but the values of the function itself. Some functions may have discontinuities or it may be expensive to calculate derivatives accurately, and thus derivative-free algorithms such as Nelder-Mead downhill simplex become very useful.
From a different perspective, optimization algorithms can be classified into trajectory-based and population-based. A trajectory-based algorithm typically uses a single agent or one solution at a time, which will trace out a path as the iterations continue. Hill-climbing is trajectory-based, and it links the starting point with the final point via a piecewise zigzag path. Another important example is simulated annealing which is a widely used metaheuristic algorithm. On the other hand, population-based algorithms such as particle swarm optimization (PSO) use multiple agents which will interact and trace out multiple paths (Kennedy and Eberhardt, 1995).
Optimization algorithms can also be classified as deterministic or stochastic. If an algorithm works in a mechanical deterministic manner without any random nature, it is called deterministic. For such an algorithm, it will reach the same final solution if we start with the same initial point. Hill-climbing and downhill simplex are good examples of deterministic algorithms. On the other hand, if there is some randomness in the algorithm, the algorithm will usually reach a different point every time the algorithm is executed, even though the same initial point is used. Genetic algorithms and PSO are good examples of stochastic algorithms.
Search capability can also be a basis for algorithm classification. In this case, algorithms can be divided into local and global search algorithms. Local search algorithms typically converge towards a local optimum, not necessarily (often not) the global optimum, and such an algorithm is often deterministic and has no ability to escape from local optima. Simple hill-climbing is such an example. On the other hand, for global optimization, local search algorithms are not suitable, and global search algorithms should be used. Modern metaheuristic algorithms in most cases tend to be suitable for global optimization, though not always successful or efficient. A simple strategy such as hill-climbing with random restarts can turn a local search algorithm into an algorithm with global search capability. In essence, randomization is an efficient component for global search algorithms.
Obviously, algorithms may not exactly fit into each category. It can be a so-called mixed type or hybrid, which uses some combination of deterministic components with randomness, or combines one algorithm with another so as to design more efficient algorithms.
Algorithms with stochastic components were often referred to as heuristic in the past, though the recent literature tends to refer to them as metaheuristics. We will follow Glover's convention and call all modern nature-inspired algorithms metaheuristics (Glover 1986, Glover and Kochenberger 2003). Loosely speaking, heuristic means to find or to discover by trial and error. Here meta- means beyond or higher level, and metaheuristics generally perform better than simple heuristics. The word "metaheuristic" was coined by Fred Glover in his seminal paper (Glover 1986), and a metaheuristic can be considered as a "master strategy that guides and modifies other heuristics to produce solutions beyond those that are normally generated in a quest for local optimality" (Glover and Laguna 1997). In addition, all metaheuristic algorithms use a certain tradeoff of randomization and local search. Quality solutions to difficult optimization problems can be found in a reasonable amount of time, but there is no guarantee that optimal solutions can be reached. It is hoped that these algorithms work most of the time, but not all the time. Almost all metaheuristic algorithms tend to be suitable for global optimization. An excellent review was carried out by Voss (2001).
Two major components of any metaheuristic algorithms are: intensification and diversification, or exploitation and exploration (Blum and Roli, 2003). Diversification means to generate diverse solutions so as to explore the search space on a global scale, while intensification means to focus the search in a local region knowing that a current good solution is found in this region. A good balance between intensification and diversification should be found during the selection of the best solutions to improve the rate of algorithm convergence. The selection of the best ensures that solutions will converge to the optimum, while diversification via randomization allows the search to espace from local optima and, at the same time, increases the diversity of solutions. A good combination of these two major components will usually ensure that global optimality is achievable.
Many problem-solving processes tend to be heuristic throughout the human history; however heuristic as a scientific method for optimization is a modern phenomenon. From the 1940s to 1960s, heuristic methods have been used in various applications, but the first landmark came with the advent of evolutionary algorithms. In 1963 Ingo Rechenberg and Hans-Paul Schwefel, then both at the Technical University of Berlin, developed evolutionary strategies while L. J. Fogel et al. developed evolutionary programming in 1966. Genetic algorithms were developed by J. Holland in the 1960s and 1970s, though his seminal book on genetic algorithms was published in 1975 (Holland 1975).
The 1980s and 1990s were the most exciting time for metaheuristic algorithms. One big step was the development of simulated annealing (SA) in 1983, an optimization technique, pioneered by S. Kirkpatrick et al., and inspired by the annealing process of metals. Another important step was the development of artificial immune systems by Farmer et al. in 1986.
The use of memory in metaheuristics was pioneered by Glover in Tabu search in the 1980s, though his seminal book on Tabu Search was published later in 1997 (Glover and Laguna 1997). Search moves are recorded in a Tabu list, and future moves should try to avoid revisiting previous solutions.
In 1992, Marco Dorigo finished his PhD thesis on optimization and natural algorithms (Dorigo 1992), in which he described his innovative work on ant colony optimization (ACO). This search technique was inspired by the swarm intelligence of social ants using pheromone as a chemical messenger. Then, in 1992, John R. Koza of Stanford University published a book on genetic programming which laid the foundation of a whole new area of machine learning, revolutionizing computer programming. Slightly later in 1995, another significant progress was the development of particle swarm optimization by James Kennedy and Russell C. Eberhart. Around 1996 and later in 1997, R. Storn and K. Price developed their vector-based evolutionary algorithm, called differential evolution (DE). This algorithm proved to be more efficient than genetic algorithms in many applications.
At the turn of the 21st century, things became even more exciting. Firstly, Zong Woo Geem et al. developed the harmony search (HS) algorithm in 2001, which is a music-inspired algorithm. Around 2002, a bacteria foraging algorithm was developed by Passino.
In 2004, S. Nakrani and C. Tovey proposed the honey bee algorithm and its application for optimizing Internet hosting centers, which was followed by the development of a novel bee algorithm by D. T. Pham et al. in 2005 and the artificial bee colony (ABC) by D. Karaboga in 2005. In 2008, the author of this article developed the firefly algorithm (FA). In 2009, Xin-She Yang and Suash Deb introduced an efficient cuckoo search (CS) algorithm (Yang and Deb, 2009; Yang and Deb, 2010), and it has been demonstrated that CS is far more effective than most existing metaheuristic algorithms including particle swarm optimization.
Many metaheuristic algorithms exist in literature and some algorithms have been discussed in detail in other Scholarpedia articles such as swarm intelligence, ant colony optimization and particle swarm optimization. This article will briefly introduce the most popular metaheuristic algorithms for optimization.
Simulated annealing is based on the metal annealing processing (Kirkpatrick et al., 1983 ). Unlike the gradient-based methods and other deterministic search methods, the main advantage of simulated annealing is its ability to avoid being trapped in local optima.
Metaphorically speaking, this is equivalent to dropping some bouncing balls over a landscape, and as the balls bounce and lose energy, they settle down in some local minima. If the balls are allowed to bounce long enough and to lose energy slowly enough, some of the balls will eventually fall into the globally lowest locations, hence the global minimum will be reached. Essentially, simulated annealing is a search along a Markov chain, which converges under appropriate conditions.
In simulated annealing, the actual search moves trace a piecewise path. With each move, an acceptance probability is evaluated, which not only accepts changes that improve the objective function (for a minimization problem, a lower objective value), but also keeps some changes that do not improve the objective (a larger objective value). The acceptance probability \(p\) is given by \[\tag{2} p=\exp\Big[-\frac{\Delta E}{k_B T}\Big], \]
where \(k_B\) is the Boltzmann's constant, \(T\) is the temperature for controlling the annealing process and \(\Delta E\) is the change in energy. This transition probability is based on the Boltzmann distribution in statistical mechanics. The change in the objective function, \(\Delta f\ ,\) can be related to \(\Delta E\) in the following way \[ \Delta E = \gamma \Delta f, \] where \(\gamma\) is a real constant (typically, \(\gamma=1\) for simplicity). From (2), it is clear that \(p \rightarrow 0\) as \(T \rightarrow 0\ .\) That is, the system virtually becomes a hill-climbing method at very low temperature. The way to control the temperature variations essentially controls how the algorithm behaves and its efficiency.
There are many ways to control the cooling rate or the temperature decrease. Two commonly used annealing schedules (or cooling schedules) are linear and geometric. For a linear cooling schedule, we have \[ T=T_0-\beta t, \] where \(T_0\) is the initial temperature and \(t\) is a pseudo time that replaces the iterations. \(\beta\) is the cooling rate and should be chosen in such a way that \(T \rightarrow T_f\) when \(t \rightarrow t_{f}\) (or the maximum number \(N\) of iterations). This usually gives \(\beta=(T_0-T_f)/t_f\ .\)
On the other hand, a geometric cooling schedule essentially decreases the temperature by a cooling factor \(0<\alpha<1\) so that \(T\) is replaced by \(\alpha T\) or \[ T(t)=T_0 \alpha^t, \quad t=1,2,...,t_f. \] The advantage of the second method is that \(T \rightarrow 0\) when \(t \rightarrow \infty\ ,\) and thus there is no need to specify the maximum number of iterations. For this reason, the geometric cooling schedule is more widely used. The cooling process should be slow enough to allow the system to stabilize. In practice, \(\alpha \in [0.7, 0.99]\) is commonly used.
In addition, for a given temperature, multiple evaluations of the objective function are needed. If there are too few evaluations, there is a danger that the system will not stabilize and subsequently will not converge to its global optimality. Conversely, too many evaluations are time-consuming and the system will usually converge too slowly, as the number of iterations to achieve stability might be exponential with the problem size. Therefore, there is a fine balance between the number of evaluations and solution quality. We can either perform many evaluations at a few temperature levels or do only a few evaluations at many temperature levels. There are two major ways to set the number of iterations: fixed or variable. The first uses a fixed number of iterations at each temperature, while the second intends to increase the number of iterations at lower temperatures to fully explore local minima.
Genetic algorithms (GAs) are probably the most popular evolutionary algorithms with a diverse range of applications. A vast majority of well-known optimization problems have been solved by genetic algorithms. In addition, genetic algorithms are population-based and many modern evolutionary algorithms are directly based on, or have strong similarities to, genetic algorithms.
Genetic algorithms, developed by John Holland and his collaborators in the 1960s and 1970s, are a model or abstraction of biological evolution based on Charles Darwin's theory of natural selection. Holland was the first to use crossover, recombination, mutation and selection in the study of adaptive and artificial systems. These genetic operators are the essential components of genetic algorithms as a problem-solving strategy. Since then, many variants of genetic algorithms have been developed and applied to a wide range of optimization problems, from graph colouring to pattern recognition, from discrete systems (such as the travelling salesman problem) to continuous systems (e.g., the efficient design of airfoil in aerospace engineering), and from financial markets to multiobjective engineering optimization.
The essence of genetic algorithms involves the encoding of solutions as arrays of bits or character strings (chromosomes), the manipulation of these strings by genetic operators and a selection based on their fitness to find a solution to a given problem. This is often done through the following procedure: 1) definition of an encoding scheme; 2) definition of a fitness function or selection criterion; 3) creation of a population of chromosomes; 4) evaluation of the fitness of every chromosome in the population; 5) creation of a new population by performing fitness-proportionate selection, crossover and mutation; 6) replacement of the old population by the new one. Steps 4), 5) and 6) are then repeated for a number of generations. At the end, the best chromosome is decoded to obtain a solution to the problem.
Each iteration, which leads to a new population, is called a generation. The fixed-length character strings are used in most genetic algorithms at each generation although there is substantial research on variable-length strings and coding structures. The coding of the objective function is usually in the form of binary arrays or real-valued arrays in adaptive genetic algorithms. An important issue is the formulation or choice of an appropriate fitness function that determines the selection criterion in a particular problem. For the minimization of a function using genetic algorithms, one simple way of constructing a fitness function is to use the simplest form \(F=A-y\) with \(A\) being a large constant (though \(A=0\) will do if the fitness does not need to be non negative) and \(y=f(\mathbf{x})\ .\) Thus the objective is to maximize the fitness function and subsequently to minimize the objective function \(f(\mathbf{x})\ .\) However, there are many different ways of defining a fitness function. For example, we can assign an individual fitness relative to the whole population \[ F(x_i)=\frac{f(\xi_i))}{\sum_{i=1}^N f(\xi_i)}, \] where\(\xi_i\) is the phenotypic value of individual \(i\ ,\) and \(N\) is the population size. The form of the fitness function should make sure that chromosomes with higher fitness are selected more often than those with lower fitness. Poor fitness functions may result in incorrect or meaningless solutions.
Another important issue is the choice of various parameter values. The crossover probability \(p_c\) is usually very high, typically in the interval \([0.7, 1.0]\ .\) On the other hand, the mutation probability \(p_m\) is usually small (typically, in the interval \([0.001, 0.05]\)). If \(p_c\) is too small, then crossover is applied sparsely, which is not desirable. If the mutation probability is too high, the algorithm could still `jump around' even if the optimal solution is close.
Selection of the fittest is carried out according to the fitness of the chromosomes. Sometimes, in order to make sure that the best chromosomes remain in the population, they are transferred to the next generation without much change, which is called elitism. A proper criterion for selecting the best chromosomes is also important, because it determines how chromosomes with higher fitness are preserved and transferred to the next generation. This is often carried out in association with a certain form of elitism. The basic form is to select the best chromosome (in each generation) which will be carried over to the new generation without being modified by the genetic operators. This ensures that a good solution is attained more quickly.
Other issues include multiple sites for mutation and the use of various population sizes. The mutation at a single site is not very efficient, so mutation at multiple sites typically increases the evolution of the search. On the other hand, too many mutants will make it difficult for the system to converge or even lead the system toward wrong solutions. In real ecological systems, if the mutation rate is too high under high selection pressure, then the whole population might become extinct.
In addition, the choice of the right population size is also very important. If the population size is too small, there will not be enough evolution, and there is a risk for the whole population to converge prematurely. In the real world, ecological theory suggests that a species with a small population is in real danger of extinction. In a small population, if a chromosome with a fitness substantially larger than the fitness of the other chromosomes in the population appears too early, it may produce enough offspring to overwhelm the whole (small) population. This will eventually drive the system to a local optimum (not the global optimum). On the other hand, if the population is too large, more evaluations of the objective function are needed, which will require extensive computing time.
As a simple example, an initial population is generated (see Fig. 2) and its final locations aggregate towards optimal solutions (see Fig. 3).
Differential evolution (DE) was developed by R. Storn and K. Price in their nominal papers in 1996 and 1997. It is a vector-based evolutionary algorithm, and can be considered as a further development of genetic algorithms. Unlike genetic algorithms, differential evolution carries out operations over each component (or each dimension of the solution). Almost everything is done in terms of vectors, and DE can be viewed as a self-organizing search, directed towards the optimum.
For a \(d\)-dimensional optimization problem with \(d\) parameters, a population of \(n\) solution vectors \(\mathbf{x}_i, (i=1,2,...,n)\ .\) is initially generated. For each solution \(\mathbf{x}_i\) at any generation \(t\ ,\) we use the conventional notation \[ \mathbf{x}_i^t=(x_{1,i}^t, x_{2,i}^t, ..., x_{d,i}^t), \ .\] This vector can be considered as a chromosome.
Differential evolution consists of three main steps: mutation, crossover and selection.
Mutation is done as follows. For each vector \(\mathbf{x}_i\) at any time or generation \(t\ ,\) we first randomly choose three distinct vectors \(\mathbf{x}_p\ ,\) \(\mathbf{x}_q\) and \(\mathbf{x}_r\ ,\) and then generate a so-called donor vector by the following mutation scheme \[\tag{3} \mathbf{v}_i^{t+1} =\mathbf{x}_p^t + F (\mathbf{x}_q^t-\mathbf{x}_r^t), \]
where \(F \in [0,2]\) is a parameter, often referred to as the differential weight. The minimum population size is \(n \ge 4\ .\) In principle, \(F \in [0,2]\ ,\) but in practice, a scheme with \(F \in [0,1]\) is more efficient and stable. The perturbation \(\mathbf{w}=F (\mathbf{x}_q-\mathbf{x}_r)\) to the vector \(\mathbf{x}_p\) is used to generate a donor vector \(\mathbf{v}_i\ ,\) and this perturbation is directed and self-organized.
The crossover is controlled by a probability \(C_r \in [0,1]\ .\) The actual crossover can be carried out in two ways: binomial and exponential. The binomial scheme performs crossovers on each of the \(d\) components or variables/parameters. By generating a uniformly distributed random number \(r_i \in [0,1]\ ,\) the \(j\)th component of \(\mathbf{x}_i\) is set as follows \[ \mathbf{x}_{j,i}^{t+1} =\mathbf{v}_{j,i}^{t+1} \quad \textrm{if } \quad r_i \le C_r, \quad (j=1,2,...,d), \] otherwise, it remains unchanged. This way, it is decided randomly whether to replace a component using the donor vector or not.
The basic DE/Rand/1/Bin scheme is carried out in equation (3), but at least 10 different schemes have been formulated. The interested reader is referred to Price et al. (2005) for details.
Ant colony optimization was pioneered by Marco Dorigo in 1992 and is based on the foraging behaviour of social ants. Many insects such as ants use pheromone as a chemical messenger. Ants are social insects and live together in organized colonies consisting of approximately 2 to 25 million individuals. When foraging, a swarm of ants or mobile agents interact or communicate in their local environment. Each ant lays scent chemicals or pheromone to communicate with others. Each ant is also able to follow the route marked with pheromone laid by other ants. When an ant finds a food source, it will mark it with pheromone and also mark the trail to and from it.
However, the pheromone concentration \(\phi\) decays or evaporates at a constant rate \(\gamma\ .\) That is, \[ \phi(t)=\phi_0 \exp [-\gamma t], \] where \(\phi_0\) is the initial concentration at \(t=0\ .\) Here the evaporation is important, as it ensures the possibility of convergence and self-organization.
From the initial random foraging route, the pheromone concentration varies and the ants follow the route with higher pheromone concentration. In turn, the pheromone is enhanced by the increasing number of ants. As more and more ants follow the same route, it becomes the favored path. Thus, some favorite routes emerge, often the shortest or more efficient ones. This is actually a positive feedback mechanism. As the system evolves, it converges to a self-organized state, which is the essence of any ant algorithm.
Bee algorithms are another class of metaheuristic algorithms, inspired by the foraging behaviour of bees. A few variants exist in the literature, including honeybee algorithm, artificial bee colony, bee algorithm, virtual bee algorithm, and honeybee mating algorithms.
Honey bees live in a colony and they forage and store honey in their constructed colony. Honey bees can communicate by pheromone and `waggle dance'. For example, an alarming bee may release a chemical message (pheromone) to stimulate attack response in other bees. Furthermore, when bees find a good food source and bring some nectar back to the hive, they will communicate the location of the food source by performing the so-called waggle dance as a signaling system. Such signaling dances vary from species to species, however, they are aimed at recruiting more bees by using directional dancing with varying strength so as to communicate the direction and distance of the food source.
For multiple food sources such as flower patches, studies show that a bee colony seems to be able to allocate forager bees among different flower patches so as to maximize their total nectar intake (Moritz and Southwick 1992).
It seems that the Honey Bee Algorithm (HBA) was first formulated around 2004 by Craig A Tovey at Georgia Tech in collaboration with Sunil Nakrani then at Oxford University as a method to allocate computers among different clients and web-hosting servers. Later in 2004 and in early 2005, Xin-She Yang at Cambridge University developed a Virtual Bee Algorithm (VBA) to solve continuous optimization problems. At about the same time, Pham et al. (2005) developed the bee algorithms. Slightly later in 2005, Haddad and Afshar and their colleagues presented a Honey-bee mating optimization (HBMO) algorithm which was subsequently applied to reservoir modelling and clustering. Around the same time, D Karabogo in Turkey developed an Artificial Bee Colony (ABC) algorithm for numerical function optimization.
Ant and bee algorithms are more suitable for discrete and combinatorial optimization and have been applied in a wide range of applications.
Particle swarm optimization (PSO) was developed by Kennedy and Eberhart in 1995, based on swarm behaviour observed in nature such as fish and bird schooling. Since then, PSO has generated a lot of attention, and now forms an exciting, ever-expanding research subject in the field of swarm intelligence. PSO has been applied to almost every area in optimization, computational intelligence, and design/scheduling applications. There are at least two dozens of PSO variants, as well as hybrid algorithms obtained by combining PSO with other existing algorithms, which are also increasingly popular.
PSO searches the space of an objective function by adjusting the trajectories of individual agents, called particles. Each particle traces a piecewise path which can be modelled as a time-dependent positional vector. The movement of a swarming particle consists of two major components: a stochastic component and a deterministic component. Each particle is attracted toward the position of the current global best \(\mathbf{g}^*\) and its own best known location \(\mathbf{x}_i^*\ ,\) while exhibiting at the same time a tendency to move randomly.
When a particle finds a location that is better than any previously found locations, then it updates this location as the new current best for particle \(i\ .\) There is a current best for all particles at any time \(t\) at each iteration. The aim is to find the global best among all the current best solutions until the objective no longer improves or after a certain number of iterations.
Let \(\mathbf{x}_i\) and \(\mathbf{v}_i\) be the position vector and velocity, respectively, of particle \(i\ .\) The new velocity vector is determined by the following formula \[\tag{4} \mathbf{v}_i^{t+1}= \mathbf{v}_i^t + \alpha \mathbf{\epsilon}_1 [\mathbf{g}^*-\mathbf{x}_i^t] + \beta \mathbf{\epsilon}_2 [\mathbf{x}_i^*-\mathbf{x}_i^t]. \]
where \(\mathbf{\epsilon}_1\) and \(\mathbf{\epsilon}_2\) are two random vectors, and each entry takes a value between 0 and 1. The parameters \(\alpha\) and \(\beta\) are the learning parameters or acceleration constants, which are typically equal to, say, \(\alpha \approx \beta \approx 2\ .\)
The initial locations of all particles should be distributed relatively uniformly so that they can sample over most regions, which is especially important for multimodal problems. The initial velocity of a particle can be set to zero, that is, \(\mathbf{v}_i^{t=0}\!\!=0\ .\) The new position can then be updated by the formula \[ \mathbf{x}_i^{t+1}=\mathbf{x}_i^t+\mathbf{v}_i^{t+1}. \] Although \(\mathbf{v}_i\) can take any value, it is usually bounded in some range \([0, \mathbf{v}_{\max}]\ .\)
There are many variants which extend the standard PSO algorithm, and the most noticeable improvement is probably to use an inertia function \(\theta (t)\) so that \(\mathbf{v}_i^t\) is replaced by \(\theta(t) \mathbf{v}_i^t\) where \(\theta\) takes a value between 0 and 1. In the simplest case, the inertia function can be taken as a constant, typically \(\theta \in [0.5, 0.9]\ .\) This is equivalent to introducing a virtual mass to stabilize the motion of the particles, and thus the algorithm is expected to converge more quickly.
As the iterations proceed, the particle system swarms and may converge towards a global optimum. For details, please refer to the Scholarpedia article on particle swarm optimization. As a simple example, an animation of particles is shown in Fig. 4 where all particles move towards the global optimum.
Tabu search was developed by Fred Glover in the 1970s, bur his seminal book was published much later in 1997. Tabu search explicitly uses memory and the search history is a major component of the method. As most algorithms are memoryless or only use results of the last or two last steps, it is initially difficult to see the advantage of using the search history. The subtleties of memory and history could introduce too many degrees of freedom, and a mathematical analysis of the algorithm behaviour becomes intractable. However, Tabu search remains one of the most successful and widely used metaheuristics in optimization.
In essence, Tabu search can be considered as an intensive local search, and the appropriate use of search history avoids revisiting local solutions by recording recently tried solutions in tabu lists. Over a large number of iterations, these tabu lists could save a significant amount of computing time, leading to improvements in search efficiency. For example, studies show that the use of tabu lists with integer programming can save computing effort by at least two orders of magnitude for a given problem, as compared with standard integer programming (Glover 19986, Glover and Laguna 1997). Many hybrid algorithms have been developed by combining Tabu search with other metaheuristics.
Harmony Search (HS) is a relatively new heuristic optimization algorithm first developed by Z. W. Geem et al. in 2001. Harmony search is related to the improvisation process of a musician. When a musician is improvising, he or she has three possible choices: (1) play any famous piece of music (a series of pitches in harmony) exactly from his or her memory; (2) play something similar to a known piece (thus adjusting the pitch slightly); or (3) compose new or random notes. If we formalize these three options for optimization, we have three corresponding components: usage of harmony memory, pitch adjustment, and randomization.
From a Markov chain point of view, pitch adjustment is a random walk which generates a new solution from the current solution \(\mathbf{x}_{ {\rm old} }\) by \[\tag{5} \mathbf{x}_{ {\rm new}}^{t+1}=\mathbf{x}^t_{{\rm old} } + b_p \mathbf{e}_i^t , \]
where \(\mathbf{e}_i^t\) is a random number drawn from a uniform distribution \([-1,1]\) and \(b_p\) is the bandwidth, which controls the local range of pitch adjustments.
The Firefly Algorithm (FA) was developed by Xin-She Yang (Yang 2008) and is based on the flashing patterns and behaviour of fireflies. In essence, FA uses the following three idealized rules:
As a firefly's attractiveness is proportional to the light intensity seen by adjacent fireflies, we can now define the variation of attractiveness \(\beta\) with the distance \(r\) by \[\tag{6} \beta = \beta_0 e^{-\gamma r^2}, \]
where \(\beta_0\) is the attractiveness at \(r=0\ .\) The movement of a firefly \(i\ ,\) attracted to another more attractive (brighter) firefly \(j\ ,\) is determined by \[\tag{7} \mathbf{x}_i^{t+1} =\mathbf{x}_i^t + \beta_0 e^{-\gamma r^2_{ij}} (\mathbf{x}_j^t-\mathbf{x}_i^t) + \alpha \; \mathbf{e}_i^t, \]
where the second term is due to the attraction. The third term is a randomization with \(\alpha\) being the randomization parameter, and \(\mathbf{e}_i^t\) is a vector of random numbers drawn from a Gaussian distribution or uniform distribution at time \(t\ .\) If \(\beta_0=0\ ,\) it becomes a simple random walk. Furthermore, the randomization \(\mathbf{e}_i^t\) can easily be extended to other distributions such as Lévy flights. A demonstrative example is shown in Fig. 5
Cuckoo search (CS) is one of the latest nature-inspired metaheuristic algorithms, developed by Xin-She Yang and Suash Deb in 2009. CS is based on the brood parasitism of some cuckoo species (Yang and Deb 2009). In addition, this algorithm is enhanced by the so-called Lévy flights, rather than by simple isotropic random walks (Pavlyukevich 2007). Recent studies show that CS is potentially far more efficient than PSO and genetic algorithms (Yang and Deb 2010).
Cuckoos are fascinating birds, not only because of the beautiful sounds they make, but also because of their aggressive reproduction strategy. Some species named ani and Guira lay their eggs in communal nests, though they may remove others' eggs to increase the hatching probability of their own eggs. Quite a number of species engage in the mandatory brood parasitism by laying their eggs in the nests of other host birds (often other species).
When generating new solutions \(\mathbf{x}^{(t+1)}\) for, say, a cuckoo \(i\ ,\) a Lévy flight is performed \[ \mathbf{x}^{(t+1)}_i=\mathbf{x}_i^{(t)} + \alpha L(s,\lambda), \] where \(\alpha>0\) is the step size which should be scaled to the problem of interest. The above equation is essentially the stochastic equation for a random walk. In general, a random walk is a Markov chain whose next status/location only depends on the current location (the first term in the above equation) and the transition probability (the second term). Here the random walk via Lévy flight is more efficient in exploring the search space, as its step length is much longer in the long run.
The Lévy flight essentially provides a random walk whose random step length (\(s\)) is drawn from a Lévy distribution \[ L(s,\lambda) \sim s^{-\lambda}, (1 < \lambda \le 3), \] which has an infinite variance and infinite mean. Here the steps essentially form a random walk process with a power-law step-length distribution with a heavy tail. Some of the new solutions should be generated by the Lévy walk around the best solution obtained so far to speed up the local search. However, a substantial fraction of the new solutions should be generated by far field randomization, that is, the locations should be far enough from the current best solution to make sure that the system will not get trapped in a local optimum.
For simplicity, a demonstrative animation is shown here. Cuckoo Search: An animation of solutions move towards the global optimum at (1,1).
There are many other metaheuristics that have not been covered in this article. For example, the so-called artificial immune systems are inspired by the characteristics of the immune system of mammals and use memory and learning as a novel approach to problem solving. The idea was first proposed by Farmer et al. in 1986, with some important work on immune networks by Bersini and Varela in 1990. These adaptive systems are very promising. Many variants have been developed over the last two decades, including the clonal selection algorithm, negative selection algorithm, immune networks and others.
The memetic algorithm, proposed by P. Moscato in 1989, is a multi-generation, co-evolution and self-generation algorithm, and it can be considered as a hyper-heuristic algorithm, rather than metaheuristic.
Another popular method is the cross-entropy method developed by Rubinstein in 1997. It is a generalized Monte Carlo method, based on rare event simulations. This algorithm consists of two phases: generation of random samples and parameter updates, with the aim of minimizing cross entropy.
Another important algorithm is the bacterial foraging optimization, developed by K. M. Passino around 2002, inspired by the social foraging behaviour of bacteria such as Escherichia coli.
Obviously, more and more metaheuristic algorithms will appear in the future. Interested readers can follow the latest literature, other Scholarpedia articles and research journals.
Let us use an example to illustrate how a metaheuristic works. The design of a compressional and tensional spring involves three design variables: wire diameter \(x_1\ ,\) coil diameter \(x_2\ ,\) and the length of the coil \(x_3\ .\) This optimization problem can be written as \[ \textrm{minimize }\ f(\mathbf{x})=x_1^2 x_2 (2 + x_3), \] subject to the following constraints \[ g_1(\mathbf{x})=1-\frac{x_2^3 x_3}{71785 x_1^4} \le 0, \] \[ g_2(\mathbf{x})=\frac{4 x_2^2 -x_1 x_2}{12566 (x_1^3 x_2 -x_1^4)} + \frac{1}{5108 x_1^2} -1 \le 0, \] \[ g_3(\mathbf{x}) =1- \frac{140.45 x_1}{x_2^3 x_3} \le 0, \] \[ g_4(\mathbf{x})=\frac{x_1+ x_2}{1.5}-1 \le 0. \] The bounds on the variables are \[ 0.05 \le x_1 \le 2.0, \quad 0.25 \le x_2 \le 1.3, \quad 2.0 \le x_3 \le 15.0. \]
For a trajectory-based metaheuristic algorithm such as simulated annealing, an initial guess say, \(\mathbf{x}_0=(1.0, 1.0. 14.0)\) is used. Then, the next move is generated and accepted depending on whether it improves or not, possibly with a probability. For a population-based metaheuristic algorithm such as PSO, a set of \(n\) vectors are generated initially. Then, the values of the objective functions are compared and the current best solution is found. Iterations proceed until a certain stopping criterion is met.
The following best solution can be found easily \[ \mathbf{x}_*=(0.051690, 0.356750, 11.28126), \quad f(\mathbf{x}_*)=0.012665. \]
Metaheuristics have been used in many applications such as engineering design optimization (Glover and Kochenberger 2003, Talbi 2008, Yang 2010). It is an area of active research, and there is no doubt that more metaheuristic algorithms and new applications will emerge in the future.