Introduction to Equation Discovery - Comparing Symbolic Regression Methods#
Upto now we have seen that the climate models we developed are using physical equations that are based on our understanding of the physical processes that govern the climate. However, these equations are often complex and difficult to solve, and they can only be used to model the climate at a coarse resolution.
Equation discovery is a different approach that uses machine learning to automatically discover equations that can be used to model the climate. Specifically, equation discovery in climate modeling can be used to:
Automate the process of developing subgrid parameterizations. Subgrid parameterizations are mathematical equations that are used to represent the effects of processes that are too small to be resolved by climate models. These equations are often complex and difficult to develop, but they can be automatically discovered using equation discovery.
Identify relationships between variables that are important for understanding climate change. Machine learning algorithms can learn from large datasets of climate data to identify relationships between variables that are important for understanding climate change.
Develop models that are more interpretable and that can be used to make better decisions about how to mitigate and adapt to climate change. Machine learning algorithms can generate equations that are easier to understand than traditional physical equations. This can help scientists and policymakers to understand how climate models work and to make better decisions about how to mitigate and adapt to climate change.
In this notebook, we’ll review some common techniques for Symbolic Regression (SR), a family of methods of discovering (simple) equations that relate inputs to outputs.
In particular, we’ll cover the following methods:
Symbolic regression based on Genetic Programming
Symbolic regression based on Manually Constructed Library + Sparse Regression
Brief overview of other techniques
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pysindy as ps
from sklearn.linear_model import Lasso, LinearRegression
from sympy import Number, latex
from rvm import RVR
Data#
When explaining the methods, we’ll consider the following bivariate system:
Although the equation is relatively simple, discovering it from \(x_0\) and \(x_1\) directly is a bit tough, since the \(\sin\) term requires going at least three levels deep in term combinations, and also involves an inner multiplication by a constant (which not all symbolic regression frameworks support).
def toy_function(x):
x0 = x[:, 0]
x1 = x[:, 1]
return x0**2 - 0.5 * x1**2 + np.sin(0.5 * x0 * x1)
def contour(
f, lines=None, linewidths=6, alpha=0.9, linestyles="solid", label=None, **kw
):
grid = np.linspace(-4, 4, 150)
xyg = np.array(np.meshgrid(grid, grid))
X = xyg.reshape(2, -1).T
zg = f(X)
xg, yg = xyg
vlim = np.abs(zg).max()
plt.contourf(xg, yg, zg.reshape(xg.shape), 300, vmin=-vlim, vmax=vlim, cmap="bwr")
plt.colorbar(label=label)
x = np.random.normal(size=(1000, 2))
y = toy_function(x)
plt.figure(dpi=150)
contour(toy_function, label="Function value")
plt.title("Toy domain for illustrating\nsymbolic regression", fontsize=16)
plt.xlabel("$x_0$", fontsize=16)
plt.ylabel("$x_1$", fontsize=16)
plt.show()
Genetic Programming#
Genetic programming is a classic technique in AI, and interest in the idea dates back at least to Alan Turing (who proposes the idea in the same paper that introduces the imitation game, or Turing test).
The general idea is to have a:
“Population” of programs (in our case, possible symbolic expressions, usually represented as trees)
Some notion of individual “fitness” (in our case, how well they model the data, as well as their simplicity)
Some means of applying random “mutations” (e.g. adding a term, deleting a term)
“Crossover” and “subtree” mutations combine features from multiple candidate equations, imitating the process of genetic recombination from sexual reproduction. “Hoist” and “point” mutations modify individual equations, imitating the process of genetic mutation from, e.g., random errors in DNA replication.
Specific mutations aside, programs can have a limited lifespan, and those with high “fitness” produce more “offspring” (possibly-mutated versions of themselves). However, everything is probabilistic, so programs with lower fitness can still persist. This is critically important, because it allows genetic programming to overcome local minima during optimization.
Note that although genetic programming can be a robust way to solve extremely difficult non-convex optimization problems (as perhaps evidenced by life itself), it also tends to be slow and resource-intensive.
gplearn#
One nicely-documented library for “pure” genetic programming-based symbolic regression is gplearn, which provides a scikit-learn style API and a number of configuration options:
Run gplearn#
import gplearn
import gplearn.genetic
gplearn_sr = gplearn.genetic.SymbolicRegressor(
population_size=5000,
generations=50,
p_crossover=0.7,
p_subtree_mutation=0.1,
p_hoist_mutation=0.05,
p_point_mutation=0.1,
max_samples=0.9,
verbose=1,
parsimony_coefficient=0.01,
function_set=("add", "mul", "sin"),
)
gplearn_sr.fit(x, y)
| Population Average | Best Individual |
---- ------------------------- ------------------------------------------ ----------
Gen Length Fitness Length Fitness OOB Fitness Time Left
0 18.01 1.49483 6 0.484762 0.54154 2.25m
1 7.38 1.01063 10 0.330553 0.387356 1.80m
2 6.30 0.905155 10 0.329078 0.400628 1.65m
3 4.40 0.820884 10 0.31151 0.558737 1.56m
4 6.87 0.829169 15 0.270369 0.284187 1.58m
5 9.72 0.807564 16 0.249315 0.246955 1.64m
6 10.08 0.791316 16 0.0855869 0.0861863 1.66m
7 10.28 0.787084 16 0.084788 0.0933768 1.57m
8 11.10 0.786517 18 0.0540637 0.0529955 1.56m
9 11.25 0.774751 18 0.0413987 0.0351855 1.55m
10 12.60 0.717222 19 0.0283082 0.0520556 1.46m
11 15.38 0.653021 19 0.0306467 0.0310092 1.50m
12 16.65 0.625027 19 0.0310127 0.0277152 1.54m
13 16.98 0.739308 19 0.0286762 0.0487439 1.45m
14 16.88 0.639717 18 0.0275947 0.0533915 1.38m
15 16.76 0.631066 18 0.0288712 0.0419033 1.42m
16 16.70 0.641338 18 0.0274046 0.0551026 1.33m
17 16.56 0.625244 18 0.0277453 0.0520362 1.27m
18 16.67 0.640259 17 0.0293917 0.077689 1.26m
19 16.14 0.643409 17 0.0293976 0.0776353 1.18m
20 15.92 0.652471 16 0.0290779 0.105859 1.15m
21 15.75 0.640067 16 0.0290357 0.106239 1.12m
22 15.73 0.643234 16 0.0284907 0.0641858 1.10m
23 15.46 0.638325 16 0.0253916 0.0920775 1.03m
24 15.31 0.628344 16 0.0274558 0.0734996 59.08s
25 15.14 0.628068 16 0.0270337 0.0772986 59.13s
26 15.06 0.647963 16 0.0258607 0.0878558 55.25s
27 14.97 0.635775 16 0.0267475 0.0798743 52.86s
28 14.86 0.632398 15 0.0264942 0.0623323 52.37s
29 15.02 0.632915 15 0.0266271 0.0611365 47.54s
30 14.97 0.62965 15 0.025187 0.0771915 47.10s
31 14.84 0.625841 15 0.0250013 0.0757683 42.45s
32 14.84 0.629502 15 0.0247215 0.0782866 39.74s
33 14.76 0.635612 15 0.024076 0.0871909 36.60s
34 14.79 0.635136 15 0.0248787 0.0768716 35.41s
35 14.62 0.633268 15 0.0233109 0.0909819 32.00s
36 14.81 0.634576 15 0.023101 0.0856682 29.86s
37 14.91 0.632228 15 0.0212863 0.109203 28.34s
38 14.81 0.639432 15 0.0248355 0.0772609 25.22s
39 14.92 0.641709 15 0.0240545 0.0842895 22.85s
40 14.89 0.63383 15 0.0225797 0.0975635 21.45s
41 14.91 0.637992 15 0.0245694 0.0796555 18.20s
42 14.78 0.628461 15 0.0233737 0.0832134 16.09s
43 14.87 0.64086 15 0.0230527 0.0861029 14.30s
44 14.75 0.639295 15 0.0214133 0.100857 11.45s
45 14.96 0.63278 15 0.0226716 0.089533 9.20s
46 14.84 0.63221 15 0.0226284 0.0899211 7.12s
47 14.92 0.637863 15 0.0229121 0.0873682 4.61s
48 14.91 0.633585 15 0.0226576 0.0896582 2.30s
49 14.93 0.64919 15 0.0227347 0.0889648 0.00s
add(mul(0.477, mul(X0, X1)), add(mul(mul(X1, -0.501), X1), mul(X0, X0)))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
add(mul(0.477, mul(X0, X1)), add(mul(mul(X1, -0.501), X1), mul(X0, X0)))
Interpret results#
print(gplearn_sr._program)
add(mul(0.477, mul(X0, X1)), add(mul(mul(X1, -0.501), X1), mul(X0, X0)))
This looks very close to the right answer, though the constants are slightly off.
[const for const in gplearn_sr._program.program if isinstance(const, float)]
[0.476603722906374, -0.5013678019794263]
This is likely because gplearn mutations which add or update constants always pick values by drawing random uniform values (within pre-specified ranges, by default -1 to 1). In my limited experience so far, this is one of the major inefficiencies.
PySR#
PySR is a different library for symbolic regression which, though still based on genetic programming, includes simulated annealing and gradient-free optimization to set the values of constants as well as performance improvements (although accessible via a Python API, the main library is written in Julia for speed). This seems to make it a bit more reliable and precise.
Run PySR#
import pysr
equations = pysr.PySRRegressor(
niterations=5,
binary_operators=["+", "*"], # operators that can combine two terms
unary_operators=["sin"], # operators that modify a single term
)
equations.fit(x, y)
Compiling Julia backend...
Started!
PySRRegressor.equations_ = [ pick score equation \ 0 0.000000 0.45290574 1 0.512922 (x0 * x0) 2 0.146916 ((x0 * x0) + -0.48262903) 3 0.000610 (((x0 * 1.0215279) * x0) + -0.5027918) 4 0.033717 (((x0 + sin(x1)) * x0) + -0.4817575) 5 1.547075 ((x0 * x0) + (x1 * (x1 * -0.5127887))) 6 0.000768 ((x0 * x0) + ((-0.49527335 * x1) * (x1 + -0.02... 7 0.460640 ((x0 * x0) + (((-0.49527335 * x1) + sin(x0)) *... 8 1.816956 (((x0 + (x1 * 0.3972897)) * x0) + (x1 * (x1 * ... 9 0.227155 ((x0 * (x0 + sin(x1 * 0.477271))) + ((-0.49527... 10 >>>> 0.239588 ((x0 * (x0 + sin(sin(x1 * 0.477271)))) + ((-0.... 11 0.062997 ((x0 * (x0 + sin(sin(sin(x1 * 0.477271))))) + ... 12 0.004188 ((x0 * (x0 + sin(sin(sin(sin((x1 + x1) * 0.257... loss complexity 0 2.552396 1 1 0.915018 3 2 0.682056 5 3 0.681224 7 4 0.658638 8 5 0.140204 9 6 0.139989 11 7 0.088316 12 8 0.014353 13 9 0.011437 14 10 0.009000 15 11 0.008451 16 12 0.008345 19 ]In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PySRRegressor.equations_ = [ pick score equation \ 0 0.000000 0.45290574 1 0.512922 (x0 * x0) 2 0.146916 ((x0 * x0) + -0.48262903) 3 0.000610 (((x0 * 1.0215279) * x0) + -0.5027918) 4 0.033717 (((x0 + sin(x1)) * x0) + -0.4817575) 5 1.547075 ((x0 * x0) + (x1 * (x1 * -0.5127887))) 6 0.000768 ((x0 * x0) + ((-0.49527335 * x1) * (x1 + -0.02... 7 0.460640 ((x0 * x0) + (((-0.49527335 * x1) + sin(x0)) *... 8 1.816956 (((x0 + (x1 * 0.3972897)) * x0) + (x1 * (x1 * ... 9 0.227155 ((x0 * (x0 + sin(x1 * 0.477271))) + ((-0.49527... 10 >>>> 0.239588 ((x0 * (x0 + sin(sin(x1 * 0.477271)))) + ((-0.... 11 0.062997 ((x0 * (x0 + sin(sin(sin(x1 * 0.477271))))) + ... 12 0.004188 ((x0 * (x0 + sin(sin(sin(sin((x1 + x1) * 0.257... loss complexity 0 2.552396 1 1 0.915018 3 2 0.682056 5 3 0.681224 7 4 0.658638 8 5 0.140204 9 6 0.139989 11 7 0.088316 12 8 0.014353 13 9 0.011437 14 10 0.009000 15 11 0.008451 16 12 0.008345 19 ]
Interpret results#
def round_expr(expr, num_digits=4):
return expr.xreplace({n: round(n, num_digits) for n in expr.atoms(Number)})
round_expr(equations.sympy())
It looks like PySR is able to not only discover the correct equation, but also format it for us nicely.
In addition, it saves a Pareto frontier of possible expressions of varying complexities, which had the lowest error of any other equations of equal or lesser complexity (with a configurable trade-off rule):
equations.equations_
complexity | loss | score | equation | sympy_format | lambda_format | |
---|---|---|---|---|---|---|
0 | 1 | 2.552396 | 0.000000 | 0.45290574 | 0.452905740000000 | PySRFunction(X=>0.452905740000000) |
1 | 3 | 0.915018 | 0.512922 | (x0 * x0) | x0**2 | PySRFunction(X=>x0**2) |
2 | 5 | 0.682056 | 0.146916 | ((x0 * x0) + -0.48262903) | x0**2 - 0.48262903 | PySRFunction(X=>x0**2 - 0.48262903) |
3 | 7 | 0.681224 | 0.000610 | (((x0 * 1.0215279) * x0) + -0.5027918) | 1.0215279*x0**2 - 0.5027918 | PySRFunction(X=>1.0215279*x0**2 - 0.5027918) |
4 | 8 | 0.658638 | 0.033717 | (((x0 + sin(x1)) * x0) + -0.4817575) | x0*(x0 + sin(x1)) - 0.4817575 | PySRFunction(X=>x0*(x0 + sin(x1)) - 0.4817575) |
5 | 9 | 0.140204 | 1.547075 | ((x0 * x0) + (x1 * (x1 * -0.5127887))) | x0**2 - 0.5127887*x1**2 | PySRFunction(X=>x0**2 - 0.5127887*x1**2) |
6 | 11 | 0.139989 | 0.000768 | ((x0 * x0) + ((-0.49527335 * x1) * (x1 + -0.02... | x0**2 - 0.49527335*x1*(x1 - 0.020218268) | PySRFunction(X=>x0**2 - 0.49527335*x1*(x1 - 0.... |
7 | 12 | 0.088316 | 0.460640 | ((x0 * x0) + (((-0.49527335 * x1) + sin(x0)) *... | x0**2 + x1*(-0.49527335*x1 + sin(x0)) | PySRFunction(X=>x0**2 + x1*(-0.49527335*x1 + s... |
8 | 13 | 0.014353 | 1.816956 | (((x0 + (x1 * 0.3972897)) * x0) + (x1 * (x1 * ... | x0*(x0 + 0.3972897*x1) - 0.4747184*x1**2 | PySRFunction(X=>x0*(x0 + 0.3972897*x1) - 0.474... |
9 | 14 | 0.011437 | 0.227155 | ((x0 * (x0 + sin(x1 * 0.477271))) + ((-0.49527... | x0*(x0 + sin(0.477271*x1)) - 0.49527335*x1**2 | PySRFunction(X=>x0*(x0 + sin(0.477271*x1)) - 0... |
10 | 15 | 0.009000 | 0.239588 | ((x0 * (x0 + sin(sin(x1 * 0.477271)))) + ((-0.... | x0*(x0 + sin(sin(0.477271*x1))) - 0.49527335*x... | PySRFunction(X=>x0*(x0 + sin(sin(0.477271*x1))... |
11 | 16 | 0.008451 | 0.062997 | ((x0 * (x0 + sin(sin(sin(x1 * 0.477271))))) + ... | x0*(x0 + sin(sin(sin(0.477271*x1)))) - 0.49527... | PySRFunction(X=>x0*(x0 + sin(sin(sin(0.477271*... |
12 | 19 | 0.008345 | 0.004188 | ((x0 * (x0 + sin(sin(sin(sin((x1 + x1) * 0.257... | x0*(x0 + sin(sin(sin(sin(0.5156346*x1))))) - 0... | PySRFunction(X=>x0*(x0 + sin(sin(sin(sin(0.515... |
equations.equations_.loss
0 2.552396
1 0.915018
2 0.682056
3 0.681224
4 0.658638
5 0.140204
6 0.139989
7 0.088316
8 0.014353
9 0.011437
10 0.009000
11 0.008451
12 0.008345
Name: loss, dtype: float64
plt.figure(figsize=(12, 3), dpi=150)
plt.bar(
np.arange(len(equations.equations_)),
equations.equations_.loss,
width=0.33,
color="blue",
)
plt.yscale("log")
plt.ylabel("Mean squared error", fontsize=14, color="blue")
plt.xticks(
range(len(equations.equations_)),
[f"${latex(round_expr(v,7))}$" for v in equations.equations_.sympy_format],
rotation=30,
ha="right",
fontsize=12,
)
plt.title("PySR Pareto frontier", fontsize=16)
plt.xlabel("Expression (more complex $\\to$)", fontsize=14)
ax2 = plt.twinx()
ax2.bar(
np.arange(len(equations.equations_)) + 0.33,
equations.equations_.score,
width=0.33,
color="orange",
)
ax2.set_ylabel("PySR score", color="orange", fontsize=14)
plt.show()
Manually Constructed Library + Sparse Regression#
Another approach to performing symbolic regression is to
Hand-construct a library of basis functions (e.g. by recursively combining all terms with various operations)
Run sparse linear regression to extract a small set of basis functions
This approach works well for problems where any constants in the underlying formula can be extracted out to the outermost level (so they can be determined by linear regression). It also works well when we have domain knowledge about which terms could potentially appear in the final sum.
It does not work well in cases where constants can’t be pulled outside certain operations (e.g. in this case, with the sine operator, though powers and derivatives are generally fine). This method can also struggle when basis function values are highly correlated, though it depends on the sparse regression method.
Let’s create an augmentation function which takes an array of features and adds additional basis functions by combining them with binary multiplication and unary sine operations:
def augment(feats, names):
existing = set(names)
new_feats = []
new_names = []
for i in range(feats.shape[1]):
name = f"sin({names[i]})"
if name not in existing:
new_feats.append(np.sin(feats[:, i]))
new_names.append(name)
for j in range(i, feats.shape[1]):
name = f"mul({names[i]}, {names[j]})"
if name not in existing:
new_feats.append(feats[:, i] * feats[:, j])
new_names.append(name)
return np.hstack((feats, np.array(new_feats).T)), names + new_names
def repeatedly_augment(feats, names=None, depth=2):
if names is None:
names = [f"x{i}" for i in range(len(feats))]
for _ in range(depth):
feats, names = augment(feats, names)
return feats, names
feats, names = repeatedly_augment(np.array(x), ["x0", "x1"], depth=2)
Applying this twice brings us from 2 features up to 37:
feats.shape
(1000, 37)
We can see that the complexity of this approach starts to blow up exponentially, though, as we increase the maximum depth of an expression:
augment(feats, names)[0].shape
(1000, 742)
augment(*augment(feats, names))[0].shape
(1000, 276397)
If we need arbitrary polynomials of up to 4th order, we end up with orders of magnitude more features than samples, which will cause problems.
Note that since this method lacks the ability to apply operations with constants within other operations, \(\sin(0.5 x_0 x_1)\) isn’t present in any of these possible sets, though \(\sin(x_0 x_1)\) is.
Let’s now see how different methods for (sparse) linear regression perform on this task:
def learned_expr(coef, names, cutoff=1e-2):
coef_idx = np.argwhere(np.abs(coef) > cutoff)[:, 0]
sort_order = reversed(np.argsort(np.abs(coef[coef_idx])))
return " + ".join(
[f"{coef[coef_idx[i]]:.4f}*{names[coef_idx[i]]}" for i in sort_order]
)
Linear regression#
We can start with completely unregularized linear regression, which will find the linear combination of basis terms that minimizes mean squared error (to high precision, due to the convexity of the problem and an analytic way of solving it).
linreg = LinearRegression()
linreg.fit(feats, y)
learned_expr(linreg.coef_, names)
'1.0412*mul(x0, x0) + -0.8116*mul(sin(x0), sin(x1)) + 0.7118*mul(x0, sin(x1)) + 0.6321*mul(x1, sin(x0)) + -0.5047*mul(x1, x1) + 0.1910*x0 + -0.1659*sin(x0) + -0.0918*x1 + 0.0848*sin(x1) + -0.0723*mul(x0, x1) + -0.0513*mul(x0, sin(x0)) + 0.0340*sin(mul(x0, x1)) + -0.0267*sin(sin(x0)) + -0.0218*mul(mul(x0, x0), mul(x0, x1)) + -0.0195*mul(x0, mul(x0, x0)) + -0.0150*mul(sin(x0), mul(x0, x0)) + -0.0114*mul(mul(x0, x1), mul(x1, x1))'
Unfortunately, this doesn’t seem to give us a very sparse solution in this case.
L1-regularized linear regression (LASSO)#
A common way to make linear regression return more sparse solutions is to minimize a linear combination of the mean squared error and the sum of the magnitudes of the weights. This is called applying an “L1 penalty” since it is the L1-norm of the weight vector, and is also referred to as the “Least Absolute Shrinkage and Selection Operator”, or LASSO. From a Bayesian perspective, this corresponds to doing Bayesian linear regression with a Laplace or double-exponential prior on the weights.
Note that although this penalty does tend to encourage sparsity, it’s also meant as a regularizer, which can help with the presence of noise.
lasso = Lasso(alpha=0.01)
lasso.fit(feats, y)
learned_expr(lasso.coef_, names)
'0.9781*mul(x0, x0) + -0.4912*mul(x1, x1) + 0.1555*mul(x0, x1) + 0.1469*mul(x0, sin(x1)) + 0.1277*mul(x1, sin(x0)) + 0.1017*sin(mul(x0, x1)) + 0.0136*mul(mul(x0, x0), mul(x1, x1))'
Here, LASSO ends up doing a pretty good job finding the first two terms of the ground-truth model (\(x_0^2 - 0.5x_1^2\)), with an approximation of the remaining \(\sin(0.5 x_0 x_1)\) term as \( 0.2(x_0\sin(x_1)+x_1\sin(x_0)) + 0.1(\sin(x_0 x_1) + x_0x_1)\):
plt.figure(figsize=(4, 4), dpi=150)
x0 = x[:, 0]
x1 = x[:, 1]
sin = np.sin
plt.scatter(
sin(0.5 * x0 * x1),
0.2 * (x0 * sin(x1) + x1 * sin(x0)) + 0.1 * (sin(x0 * x1) + x0 * x1),
)
plt.plot([-1, 1], [-1, 1], color="red")
plt.xlabel(r"$\sin(0.5 x_0 x_1)$", fontsize=12)
plt.ylabel(
r"$0.2(x_0\sin(x_1)+x_1\sin(x_0))$" + "\n" + r"$+ 0.1(\sin(x_0 x_1) + x_0x_1)$",
fontsize=12,
)
plt.show()
Relevance vector machines (RVM)#
Relevance vector machines, used by Zanna and Bolton 2020 for equation discovery, provide another way of learning sparse linear models in a Bayesian manner, though with a different prior that prioritizes sparsity over shrinkage when the data permits it. RVMs can also be used with a kernel trick to learn nonlinear decision boundaries which are sparse in a kernelized feature space, though in this case, we use a linear kernel on top of our manually computed library features.
rvm = RVR(verbose=False)
rvm.fit(feats, y, names)
learned_expr(rvm.m_, rvm.labels)
'1.0157*mul(x0, x0) + -0.8171*mul(sin(x0), sin(x1)) + 0.7149*mul(x0, sin(x1)) + 0.6434*mul(x1, sin(x0)) + -0.4903*mul(x1, x1) + -0.0806*mul(x0, x1) + 0.0337*sin(mul(x0, x1)) + -0.0210*mul(mul(x0, x0), mul(x0, x1)) + -0.0178*mul(x1, sin(x1)) + 0.0124*mul(sin(x1), sin(x1)) + -0.0117*mul(mul(x0, x1), mul(x1, x1)) + -0.0113*mul(x0, sin(x0)) + -0.0107*mul(sin(x0), sin(x0))'
RVM does discover the \(x_0^2\) and \(-0.5x_1^2\) terms almost exactly, but ends up with an even more complex approximate expression for the missing \(\sin(0.5x_0x_1)\) term.
Sequentially Thresholded Least Squares (STLSQ) from pysindy#
The PySINDy library provides a number of efficient Python implementations of sparse regression algorithms. Their default regression method is sequentially thresholded least squares (STLSQ), which (as its name might suggest) repeatedly runs L2-regularized linear regression with penalty strength alpha
and prunes features with weight magnitudes below a given threshold
.
stlsq = ps.optimizers.stlsq.STLSQ(alpha=0.0001, threshold=0.1)
stlsq.fit(feats, y)
learned_expr(stlsq.coef_[0], names)
'1.0006*mul(x0, x0) + -0.4997*mul(x1, x1) + 0.3492*mul(x0, sin(x1)) + 0.3487*mul(x1, sin(x0)) + -0.1385*mul(sin(x0), sin(x1))'
This method ends up finding a sparser solution than Lasso, though I did tweak the parameters a bit to make that happen :)
It’s worth checking out PySINDy’s full suite of sparse regression methods for other approaches!
Overall, all of these methods were able to find the \(x_0^2\) and \(-0.5x_1^2\) terms, but all of them also needed to find approximations for the remaining term because it wasn’t in the feature library. Meaning the expressions are both less accurate and more complex.
Tweaking the Feature Library#
Now, we can potentially resolve some of these issues by pre-multiplying \(x_1\) by 0.5 before the augmentation process (which we’ll call \(x_{1h}\)).
feats2 = np.array(x)
feats2[:, 1] = feats2[:, 1] * 0.5
names2 = ["x0", "x1h"]
depth = 2
for _ in range(depth):
feats2, names2 = augment(feats2, names2)
In this case, the correct model should be expressed as
which is now a linear combination of our basis terms. Let’s see if these regression techniques can find it:
linreg = LinearRegression()
linreg.fit(feats2, y)
learned_expr(linreg.coef_, names2)
'-2.0000*mul(x1h, x1h) + 1.0000*sin(mul(x0, x1h)) + 1.0000*mul(x0, x0)'
lasso = Lasso(alpha=0.01)
lasso.fit(feats2, y)
learned_expr(lasso.coef_, names2)
'-1.8387*mul(x1h, x1h) + 0.9846*mul(x0, x0) + 0.8594*sin(mul(x0, x1h)) + -0.0415*mul(mul(x1h, x1h), mul(x1h, x1h)) + 0.0332*mul(x0, x1h) + 0.0110*mul(mul(x0, x0), mul(x0, x1h))'
rvm = RVR(verbose=False)
rvm.fit(feats2, y, names2)
learned_expr(rvm.m_, rvm.labels)
'-2.0000*mul(x1h, x1h) + 1.0000*sin(mul(x0, x1h)) + 1.0000*mul(x0, x0)'
stlsq = ps.optimizers.stlsq.STLSQ(alpha=0.0001, threshold=0.1)
stlsq.fit(feats2, y)
learned_expr(stlsq.coef_[0], names2)
'-2.0000*mul(x1h, x1h) + 1.0000*mul(x0, x0) + 1.0000*sin(mul(x0, x1h))'
In this case, every method except Lasso finds the true model exactly (which held true across many choices of regularization parameter).
Handling noise#
What if we add noise to the data?
noise = np.random.normal(size=feats2.shape) * 0.01
linreg = LinearRegression()
linreg.fit(feats2 + noise, y)
learned_expr(linreg.coef_, names2)
'0.8985*mul(x0, x0) + -0.8420*mul(x1h, x1h) + 0.8242*sin(mul(x0, x1h)) + -0.6962*mul(x1h, sin(x1h)) + -0.2801*mul(sin(x1h), sin(x1h)) + -0.2216*mul(mul(x1h, x1h), mul(x1h, x1h)) + 0.2033*mul(x0, sin(x0)) + -0.1835*sin(mul(x1h, x1h)) + 0.1605*mul(x1h, sin(x0)) + -0.1359*mul(sin(x0), sin(x0)) + -0.1199*mul(x1h, mul(x0, x0)) + 0.1141*mul(sin(x1h), mul(x1h, x1h)) + -0.1095*x1h + 0.0943*mul(x0, x1h) + -0.0878*mul(sin(x0), sin(x1h)) + -0.0755*mul(mul(x0, x1h), sin(x1h)) + 0.0680*x0 + 0.0644*mul(mul(x0, x0), sin(x1h)) + 0.0612*sin(sin(x1h)) + 0.0579*mul(x0, mul(x0, x1h)) + -0.0555*mul(x1h, mul(x1h, x1h)) + -0.0512*mul(mul(x0, x1h), mul(x1h, x1h)) + -0.0496*sin(sin(x0)) + 0.0372*sin(x1h) + 0.0371*mul(mul(x0, x0), mul(x1h, x1h)) + -0.0369*mul(mul(x0, x1h), mul(x0, x1h)) + 0.0354*mul(x1h, mul(x0, x1h)) + 0.0297*mul(x0, sin(x1h)) + 0.0193*mul(sin(x0), mul(x1h, x1h)) + -0.0186*sin(x0) + 0.0140*sin(mul(x0, x0)) + -0.0110*mul(sin(x0), mul(x0, x0)) + 0.0110*mul(x0, mul(x1h, x1h))'
lasso = Lasso(alpha=0.01)
lasso.fit(feats2 + noise, y)
learned_expr(lasso.coef_, names2)
'-1.8389*mul(x1h, x1h) + 0.9849*mul(x0, x0) + 0.8410*sin(mul(x0, x1h)) + 0.0467*mul(x0, x1h) + -0.0417*mul(mul(x1h, x1h), mul(x1h, x1h)) + 0.0101*mul(mul(x0, x0), mul(x0, x1h))'
rvm = RVR(verbose=False)
rvm.fit(feats2 + noise, y, names2)
learned_expr(rvm.m_, rvm.labels)
'0.9690*mul(x0, x0) + -0.8764*mul(x1h, x1h) + 0.8401*sin(mul(x0, x1h)) + -0.7110*mul(x1h, sin(x1h)) + -0.2427*mul(sin(x1h), sin(x1h)) + -0.2120*mul(mul(x1h, x1h), mul(x1h, x1h)) + -0.1688*sin(mul(x1h, x1h)) + 0.1190*mul(x1h, sin(x0)) + 0.0997*mul(x0, x1h) + 0.0927*mul(sin(x1h), mul(x1h, x1h)) + -0.0619*mul(mul(x0, x1h), sin(x1h)) + -0.0614*mul(x1h, mul(x1h, x1h)) + 0.0600*mul(x0, sin(x0)) + 0.0495*mul(mul(x0, x0), sin(x1h)) + -0.0479*mul(x1h, mul(x0, x0)) + -0.0465*mul(mul(x0, x1h), mul(x1h, x1h)) + 0.0422*mul(x1h, mul(x0, x1h)) + -0.0377*mul(sin(x0), sin(x1h)) + -0.0339*mul(sin(x0), sin(x0)) + 0.0104*mul(sin(x0), mul(x1h, x1h))'
stlsq = ps.optimizers.stlsq.STLSQ(alpha=0.0001, threshold=0.1)
stlsq.fit(feats2 + noise, y)
learned_expr(stlsq.coef_[0], names2)
'-1.1666*mul(x1h, x1h) + 1.0006*mul(x0, x0) + 0.9962*sin(mul(x0, x1h)) + -0.8574*mul(x1h, sin(x1h)) + -0.1116*mul(mul(x1h, x1h), mul(x1h, x1h))'
In this case, it actually becomes Lasso which gets closest to the true model (i.e. is the only regression technique whose first three leading terms match the ground-truth model). This reversal illustrates how noise can strongly impact the effectiveness of different methods for learning symbolic models.
Let’s see how genetic programming-based techniques deal with noise.
noisy_equations = pysr.PySRRegressor(
niterations=5,
binary_operators=["+", "*"], # operators that can combine two terms
unary_operators=["sin"], # operators that modify a single term
)
noisy_equations.fit(x, y + np.random.normal(size=y.shape));
Started!
noisy_equations.equations_
complexity | loss | score | equation | sympy_format | lambda_format | |
---|---|---|---|---|---|---|
0 | 1 | 3.532013 | 0.000000 | 0.4358215 | 0.435821500000000 | PySRFunction(X=>0.435821500000000) |
1 | 3 | 1.836234 | 0.327075 | (x0 * x0) | x0**2 | PySRFunction(X=>x0**2) |
2 | 5 | 1.586974 | 0.072944 | ((x0 * x0) + -0.49931914) | x0**2 - 0.49931914 | PySRFunction(X=>x0**2 - 0.49931914) |
3 | 7 | 1.584167 | 0.000885 | (((x0 * x0) + -0.5353973) * 1.0403335) | 1.0403335*x0**2 - 0.55699174699955 | PySRFunction(X=>1.0403335*x0**2 - 0.5569917469... |
4 | 8 | 1.583757 | 0.000259 | (((x0 * x0) + sin(-0.5433997)) * 1.0424162) | 1.0424162*x0**2 - 0.538980219457963 | PySRFunction(X=>1.0424162*x0**2 - 0.5389802194... |
5 | 9 | 1.000580 | 0.459220 | ((x0 + (x1 * -0.37969938)) * (x0 + x1)) | (x0 - 0.37969938*x1)*(x0 + x1) | PySRFunction(X=>(x0 - 0.37969938*x1)*(x0 + x1)) |
6 | 12 | 0.926672 | 0.025578 | ((x0 + (x1 * sin(-0.37969938 + -0.10741321))) ... | (x0 - 0.46807627315713*x1)*(x0 + x1) | PySRFunction(X=>(x0 - 0.46807627315713*x1)*(x0... |
7 | 13 | 0.909419 | 0.018794 | ((x0 + (x1 * sin(sin(sin(sin(-0.69789875))))))... | (x0 - 0.534615915297832*x1)*(x0 + x1) | PySRFunction(X=>(x0 - 0.534615915297832*x1)*(x... |
8 | 15 | 0.909410 | 0.000005 | ((x0 + (x1 * sin(sin(sin(-0.34467646 + sin(-0.... | (x0 - 0.536145389299422*x1)*(x0 + x1) | PySRFunction(X=>(x0 - 0.536145389299422*x1)*(x... |
round_expr(noisy_equations.sympy())
noisy_equations.equations_.sympy_format.values[-4]
It looks like PySR was still able to discover the true expression as part of its Pareto frontier, though with the noise it’s only rated third-best for the default tradeoff between complexity and performance (with a sin-less version taking first).
Other Methods for Symbolic Regression#
To finish the notebook, here are a few other techniques for symbolic regression that don’t fit under either previously described methods (and all developed in the last several years):
STreCH (Julia implementation here) formulates symbolic regression as a non-convex mixed-integer nonlinear programming (MINLP) technique, which it solves using a heuristic-guided cutting plane formulation.
EQL (Python implementation here) develops “equation layers” for neural networks which can be readily interpreted and also chained together, and are regularized with a thresholded L1 penalty to encourage sparsity.
AI Feynman (hybrid Python-Fortran implementation here) starts by training a neural network, then computes input gradients of the learned network, and then uses attributes of those gradients to decompose the symbolic regression problem into something more tractable using graph theory.
Symbolic Metamodeling (Python implementation here) uses gradient descent to learn compositions of Meijer-G functions, a flexible family that can be programmatically projected back onto a wide set of analytic, closed-form, or even algebriac expressions.