I believe pretty strongly in evolution. I’m quite skeptical of people who argue against it. But if you asked me to lay out the strongest case for evolution, until recently… I’d be pretty hard-pressed? I’d probably mumble something about fossils and then get kind of embarrassed.
If I can’t make the case, then why do I believe it so strongly? I think the honest answer is that I find the people who argue for evolution much more credible than the people who argue against it. It seems pretty clear to me that evolution is a widely accepted scientific consensus, and I trust scientific consensus unless I have a really really good reason not to. And yes, I’m sure you can find a good scientist who rejects it – but you can find good scientists with almost any view you want if you look hard enough. Knowing why X is true vs. thinking X is true because a smart person you know believes X is what I’d call “first order knowledge” vs. “second order knowledge”. I think good second order knowledge is a totally valid reason to believe something. Second order knowledge is kind of fascinating in its own right, but let’s leave that for another post.
Historically that’s why I’ve believed in evolution and I think that’s a fine reason, but… I kind of want to be able to make the case myself? I want to be able to make the case for evolution from first principles.
So I read The Greatest Show on Earth in which Dawkins argues that evolution is undeniable (although it’s denied every day, a fact which clearly irritates him to no end). Here’s my attempt at summarizing his 500 page book into a very short blog post so that my future forgetful self can remind himself of the main argument from time to time.
You only need to accept two very plausible sounding premises for evolution to become almost a logical inevitability.
If there is variation among individuals within a species, then surely that variation will have some impact, however small, on the likelihood of reproduction. If the variation is hereditable, then the variation which improves the likelihood of reproduction will be passed on more than than the variation that decreases the likelihood of reproduction.
If you let this process iterate for a really really long time – an unfathomably long time – you’ll end up with things as different as birds, sharks, and humans all from a common ancestor.
I think the best evidence is that we can directly observe both these premises on human timescales. Consider selective dog breeding, or banana farmers selecting for larger bananas. In a few hundred years you can end up with a drastically different gene pool for dogs or bananas. In those examples, one can observe both the fact that variation is introduced from generation to generation, and that the variation is hereditable via selective breeding.
In the cases where we can directly inspect DNA, we can observe mutations from one generation to the next (presumably random, although that seems hard to prove). And we can observe that those mutations are decently likely to be inherited by the next generation.
In order to evolve from a single celled organism to a human you need countless mutations, each of which is incredibly unlikely. If you multiply those tiny probabilities together, the probability of this happening is essentially zero.
First off, the timescale of evolution is hard to fathom.
Second, saying the probability of our particular sequence of mutations is essentially zero feels similar (to me) to drawing a number between 1 and a billion, getting 487223, and saying the probability of getting that particular number was essentially zero. Any time you observe one particular outcome out of an immense set of possibilities, the probability of that particular observation a priori will be very small, but something had to happen. If you draw a number of between 1 and a billion, you will - with 100% probability - observe an event which was only 1 in a billion to happen. There’s a sort of meta perspective which makes it unsurprising to observe such a “surprising” event.
Our particular sequence of mutations was exceedingly unlikely – granted! – but it didn’t have to happen that way. This is made clear by the fact that apes went through a different sequence of mutations, as did sharks, as did mushrooms.
If both humans and whales evolved from the same common ancestor, where are the whale-people? Or where are the whale-people fossils? How come you can’t find any?
This is just a straight up misunderstanding. Evolution does not predict that there will be species A and B and then every variation in between. What it predicts is that there will be some common ancestor of A and B that lived probably a super long time ago that also probably doesn’t look anything like either A or B does today!
For A and B consider humans and whales. According to google, the last common ancestor of humans and whales was a small land-dwelling shrew-like creature. That’s nothing like a either a human or a whale. It’s not that humans evolved from whales or that whales evolved from humans, it’s that we both evolved from shrews.
So you shouldn’t expect to see - even in the fossil record - some whale-human fossil. You might expect to find something that’s kind of between a whale and a shrew, and something else kind of between a human and a shrew, but even that is probably very oversimplified.
Biologists like classifying stuff. Animals are in a different “kingdom” than plans or fungi. Within the animal kingdom, animal species are separated into different “phylum”, e.g. spiders are in the “arthropod” phylum and tigers are in the “chordata” phylum.
So what phylum was the most recent common ancestors of spiders and tigers in? It’s really hard to say. It’s probably not even well defined.
I suspect this is NOT true, but let’s say the most common ancestor of tigers and spiders looked almost identical to a spider. Let’s just say it was a spider, species-wise. You trace the lineage between that common ancestor and some particular tiger today. At what point did the species change from spider to tiger? More realistically, that lineage went through many different species, so let’s just ask: what was the first common ancestor that wasn’t a spider?
Isn’t the definition of species “a group of organisms that can reproduce naturally with one another”? Do you think there was ever a particular generation which was so different from the last that it couldn’t interbreed? I suspect not. So… that means that we can create a chain from spider to tiger where every link in the chain should be considered the same species as the last, and yet… we end up with a different species?
The answer is that all this is fuzzier and more continuous that biology class might make it seem sometimes. Classifications are useful in practice but the idea that you can cleanly divide all living things (including common ancestors) into different groups based on the ability to interbreed is just false.
Dawkins like the analogy of sculpting, because a sculptor – unlike other artists – is mostly in the business of subtracting. A sculptor starts with a solid block of marble and gradually cuts away pieces until a statue remains.
Evolution uses a similar process. It gradually cuts away genes that are not advantageous (via making reproduction less likely) from the gene pool. Any given animal/gene is not directly affected by evolution, but by genes being either more or less likely to be passed on via reproduction, evolution sculpts the gene pool over generations.
Dawkins described a pretty simple way to conceptualize evolution in terms of a simple computer program.
Let’s assume we start with a population has a uniform distribution over some attribute - say height. At each time step, we simulate reproduction (asexual, to keep things really simple) and we give a slight reproductive advantage to members of the population with height above some arbitrary threshold. Here is one possible outcome of such a simulation:
]]>Matter is made of atoms. Atoms come in different flavors, which we call elements. The number of protons is fixed for any given element, and is equal to the number of electrons. So, for example, carbon atoms have 6 protons and electrons while nitrogen atoms have 7.
The number of neutrons, however, is not fixed. Carbon atoms can have different numbers of neutrons. These different variations are called “isotopes”. The most common three isotopes of carbon are carbon-12 (98.9%), carbon-13 (1.1%), and carbon-14 (<0.01%). These isotopes have 6, 7, and 8 neutrons respectively. The number associated with each isotope is the “mass number” which equals the number of protons + neutrons – electrons are so light they don’t contribute to the mass number.
Certain isotopes are unstable, by which I mean they spontaneously decay into something else, at a predictable rate. The predictability of the rate of decay is key.
The decay tends to happen in one of three ways:
For example, potassium-40 decays into argon-40. The favored measure of decay rate is called the “half-life”, which is the amount of time it takes for half of the potassium-40 atoms to decay into argon-40 atoms. For this particular pairing, the half-life is 1.26 billion years.
Why does this help? Well, if you happen to know that at some moment in time in the past, let’s call it time X, there was only potassium-40 and no argon-40, then by measuring the ratio of potassium-40 to argon-40 now, you can compute the amount of time that has passed since X. It’s important to note that this only works if you know that, at time X, there was no argon-40. In this case, igneous rocks are solidified from molten rock (magma or lava) and at the moment of solidification, the rock contains potassium-40 but no argon-40.
So let’s say you take an igneous rock and measure the ratio of potassium-40 to argon-40 and it’s 0.5. That means half the potassium-40 has decayed into argon-40. Since we know the half-life of potassium-40 is 1.26 billion years, that means the rock was formed about 1.26 billion years ago. What if the ratio was 0.25? That means the amount of potassium-40 has halfed twice, so it’s been 2.52 billion years since that particular rock was formed. Hopefully it’s clear how this generalizes.
And here we come to the point of this post. It seems to me that this real-world scenario can motivate a host of interesting math questions, across a surprisingly broad range of difficulty levels.
For elementary school, you can set up the ratios to form an integer number of half-lives:
For high school, you can make the ratios whatever you want, which requires logs.
You can also get into “experiment design” or “thinking like a scientist”:
For college, you can start to introduce differential equations.
You can make this really hard:
Is it harder if you make a loop?
Other people at my company have written less general frameworks. They put restrictions on the type of programs that you’re allowed to write within their framework. This is all a little abstract, so here’s an example:
One common restriction is that your component structure will be a DAG. By DAG, I mean a “directed acyclic graph”. The key word here is acyclic. If component A knows about component B, then B can’t know about A. There is a sort of one-way directionality to the arrangement of the components.
Oh yeah, what even is a framework? Obviously they come in all shapes and sizes, but I think a pretty common theme is they have some conception of a “component” and that an application is built by composing the components together in some way. Again – very abstract, but useful as a way to talk about frameworks.
Generally speaking, components talk to each other. They pass data between one another. This leads me to another example of a framework which puts a rather severe restriction on the applications that can be built within it. The framework wants to control the flow of data between components, and therefore provides a single mechanism for writing data. The particular mechanism isn’t very interesting, it’s effectively a function you can call to write a single “atom” of data to downstream listeners.
This has big implications. For example, it means that data flow is push-based and not pull-based. You can’t “ask for the next piece of data”, you just get it whenever it’s available, process it, and then maybe push data downstream. Since the data flow is not pull based, it rules out interactions like a component saying “something changed and now I’d like data for stock B in addition to what you’re already sending me”.
My framework doesn’t constrain applications like that. It’s super flexible. I thought that was a good thing. But there are two major costs for that level of generality:
People ask, so what can components of your framework do?. The answer is anything. That’s cool, I guess, but it certainly doesn’t help me understand.
The more concrete and specific a thing is, the easier it is to understand (usually). The more abstract and general a thing is, the harder it is to understand.
What does the framework actually do? The answer is not much. It’s more of a way to structure your program than a thing that actually provides functionality to you at runtime.
Why? Because it kind of can’t do much. By being completely agnostic to, for example, how components communicate, it can’t publish metrics on how much data is flowing between which components. It definitely can’t run two components on different processes because it has no idea how to ferry data from one component to another.
One way to think about this is that every restriction a framework puts on the application is giving the framework information about how the application will (or will not) behave. Sometimes, that information is really useful and enables the framework to do non-trivial work for you.
Even if I think my own, extremely general framework. The most important part is what you aren’t allowed to do. Certain components aren’t allowed to do IO. By adding this restriction, we can create applications that are much more testable, and can even be simulated using historical data.
But what about the fact that every restriction limits the type of programs that work within that framework! Don’t you want your framework to be as widely applicable as possible?
Totally. Like almost everything else in life, it’s a trade-off. There’s a spectrum between broadly-applicable-but-only-a-little-helpful to narrowly-applicable-but-extremely-helpful.
The idea that the restrictions is where the value comes from reminds me of another domain: programming languages.
There are languages out there that are extremely flexible. They let you do basically anything. You want to add a number to a string? Sure! You want to subtract a number from a string? No problem!
'10' + 3; // '103'
'10' - 3; // 7
I don’t like these languages. Don’t get me wrong, I’m not here to hate on python or javascript. I use them and they’re incredibly useful. But given the choice, I’ll take a strongly typed compiled programming language any day of the week (especially for large programs). Why? Because it stops me from doing crazy things like trying to do math on strings.
Python is a nice example because you can write it with or without types. Adding type annotations to a python program is very clearly a restriction on the set of programs you can run. You write a program in python. You can run it! You add type annotations to that program and typecheck it. Maybe you can still run it? Or maybe it doesn’t compile. The “value” comes from stopping you from running programs that don’t typecheck. There is no new functionality magically comes from typechecking. It’s just a way to stop you from doing things that are probably (not not necessarily!) wrong.
I recently listened to a podcast about programming where they discussed this trade-off between broadly-applicable-but-only-a-little-helpful and narrowly-applicable-but-extremely-helpful in a different context. Here’s a quote (by Yaron Minsky):
]]>But, like, in some sense the scale of optimizations are very different. Like, if you come up with a way of making your compiler faster that, like, takes most user programs and makes them 20% faster, that’s an enormous win. Like, that’s a shockingly good outcome. Whereas, if you give people good performance engineering tools, the idea that they can think hard about a particular program and make it five, or 10, or 100 times faster is like, in some sense, totally normal.
I work in the stock market. If I told you that there’s a 25% chance that apple stock was going to be worth \$200 and a 75% chance it would be worth \$160, what is the most you would pay for 1 share? I claim the “right” answer is $0.25 \cdot \$200 + 0.75 \cdot \$160 = \$170$. In other words, I think the “fair” value of the stock is $170. That also happens to be the mean. Why didn’t I pick the median or the mode?
I think it comes down to how bets work in the stock market. In the stock market, if you pay \$X for something and it turns out to be worth \$Y you get (or pay) \$(Y - X). When bets work that way, the optimal bet to make is to buy below the mean and to sell above the mean. The mean is the value where you don’t expect to make or lose money regardless of whether you buy or sell. It’s “fair”.
What if bets worked differently? Right, that’s where I was going with this.
What if the stock market worked totally differently? What if you had to make a guess at where the stock was going to end up at the end of the year. For simplicity, let’s say stock prices always got rounded to the nearest dollar. If you guesed right, you get \$100. If you guessed wrong, you get \$0. Now, assuming aaple stock had the same possible outcomes (25% chance of \$200 and 75% chance of \$160), what would you bet?
\$160 of course. You’d bet the mode! All that matters is whether you’re right or not, and the mode is the most likely value to be right.
Ok, but that seems really weird and contrived. I kind of agree, but isn’t that kind of how horse races work? You just bet on a horse and you get money if you’re right? Ignoring all the complexity of odds, you’d want to just bet on the mode (the horse that’s most likely to win).
Now let’s come up with a betting structure for which the median is the “fair” value. I even bet you’ve used this one before with your friends. The way the bet works is that you each put up \$20 and guess the value of something. Whoever is closer gets the money.
In this case, you should bet the median! Half the values are lower than you, half the values are higher than you. Whatever your opponent guesses will be right less often than your guess. It’s optimal!
I don’t have any insightful grand finally here. I just found it interesting that I almost never think about values that aren’t “the mean” (although I do think a lot about variance and correlations) and I think this is a pretty plausible explanation as to why. If I was in a business where “closest wins”, I bet I would care a whole lot more about the median than the mean.
]]>In How good is Elo for predicting chess?, we observed that the Elo formula systematically overpredicts the expected score of the better player. For example, if one player has an Elo rating that’s 400 above their opponent, the Elo formula predicts an average score of 0.91 (which you can very approximately interpret as a 91% chance of winning), however empirically that player only averages a score of about 0.85 (using a dataset of about 10 million online games played on https://lichess.org).
In Noisy Elo, we made a guess at what might explain the underperformance of actual expected scores relative to the Elo formula’s prediction: maybe Elo ratings are a noisy measurement of true Elo ratings. We assumed that the noise was normally distributed with zero mean, and we fit a standard deviation empirically. It turned out a standard deviation of 110 fit the data well.
We left off by musing about what could give us more confidence in our guess that noise is what explains why the Elo formula overpredicts the expected score of the better player. This situation felt reminiscent of progress in physics:
You have a theory for how the world works, but new empirical data shows up that disagrees with the theory. Assuming the new data is valid, you try to come up with a new model that fits the data. And you find one! Your model fits the new data well. But how do you convince others to use your new model, especially in light of potentially other new models that also fit the data? One approach is to propose a new experiment for which your model predicts a different outcome than the old model. If your model predicts the correct outcome of a yet-to-be-conducted experiment, for which the old model would have been wrong, that’s fairly compelling evidence to start using it in place of the old model.
Here’s the intuition for the experiment: All the Elo formula cares about is the difference of the two ratings. It doesn’t matter whether the two players have ratings of (1500 and 1900) or (2500 and 2900) – both pairs of ratings have a difference of 400 and so the Elo formula will predict the same score for the better player in both scenarios.
Our model might not have that property. Intuitively, I’d expect our model to predict a lower score for the 2900 player in the 2500 vs. 2900 game than the 1900 player in the 1500 vs. 1900 game. Why? Because our model won’t “believe” the 2500 or 2900 ratings as much as the 1500 and 1900 ratings because they’re so rare. It will assume a lot of the 2500 and 2900 rating is noise. I’m expecting our model to “squeeze” the 2500 and 2900 ratings closer together than the 1500 and 1900 ratings when predicting the true Elos. If that’s right, then the expected score of the 2900 player will be closer to 0.5 (an even match) than the expected score of the 1900 player.
If our model makes different predictions than the tweaked Elo formula that we fit to empirical data, it provides an opportunity to test our theory. We can see whether empirical results depend on the absolute ratings or just the difference between ratings. If empirical results depend on absolute ratings, that suggests our theory might be correct.
Let’s see!
The plan: I want to compute the expected score of a 1900 rated player when playing a 1500 rated player. Likewise for a 2900 rated player when playing a 2500 rated player.
Assumptions: We’re going to assume that “true Elo” is normally distributed with a mean of 1630 and a stdev of 290. Rating are a noisy measurement of true Elo, with noise $\mathcal{N}(0, 110)$ (fit empirically in the last post), which makes ratings $\mathcal{N}(1630, 310)$ (also fit empirically).
As a step towards computing the expected score of a 1900 rated player when playing a 1500 rated player, let me first compute the expected true Elo of a 1900 rated player. In fact, why don’t I just compute it for all possible ratings:
Huh! That looks… more like a line than I expected. I added the line $y=x$ to help see that the slope of E[true Elo | rating] is less than 1. In other words, we tend to expect true Elos to be closer to the population average (1630) than their rating. That part definitely makes sense.
But I didn’t expect it to be a straight line. Remember, I expected our model to squeeze 2500 and 2900 ratings closer together than 1500 and 1900. But that’s not what this line is telling me. The fact that it’s a line means it will squeeze them by exactly the same amount. Let’s demonstrate that:
Rating | Expected true Elo |
---|---|
1500 | 1514.5 |
1900 | 1869.8 |
2500 | 2402.7 |
2900 | 2758.0 |
The expected difference in true Elo between the 1500 and 1900 rated players is (1869.8 - 1514.5) = 355.3 and the expected difference in true Elo between the 2500 and 2900 rated players is (2758 - 2402.7) = 355.3. The same!
So my intuition was wrong. The expected difference in true Elo between two players does not depend on their absolute ratings, only the difference between their ratings. The experiment failed before it even began…
Well, the experiment failed, but maybe there’s a silver lining. Recall that the elo formula predicts a score of 0.91 for the better player when the rating difference is 400, but we found that empirically the rating difference needed to be more like 525 in order for the expected score of the better player to be 0.91. Maybe we have a simple explanation for this. Maybe we need the rating difference to be 525 because then the expected true Elo difference is 400. Let’s check.
Rating | Expected true Elo |
---|---|
1500 | 1514.5 |
2025 | 1980.8 |
The difference in the expected true Elo is (1980.8 - 1514.5) = 466. Huh! Wrong again.
I made a classic mistake (and I honestly did make this mistake) - did you spot it? I assumed that EloFormula(E[true Elo difference]) = E[EloFormula(true Elo difference)]. I computed the former by plugging in the expected true Elo difference into the formula, but what we really want to compute is the expected value of formula given all the possible true Elo differences (weighted by their probability).
For concave functions like the Elo formula $E[f(x)] < f(E[x])$. I can show you why on the graph:
What we need to do is first compute E[f(x)] – not f(E[x]). To do that, let’s first find the PDF of the true Elo difference given a rating difference of 525, and then use that entire distribution to compute the expected score of the better player (by plugging in those true Elo differences into the Elo formula and taking a weighed average of the results).
How can we compute the PDF of the true Elo difference given a rating difference of 525?
Here’s a fact about adding two normal distributions that we’ve used before:
\[\begin{align} X &\sim \mathcal{N}( \mu_x,\sigma_x) \\ Y &\sim \mathcal{N}( \mu_y,\sigma_y) \\ X + Y &\sim \mathcal{N}( \mu_x + \mu_y,\sqrt{\sigma_x^2 + \sigma_y^2}) \\ \end{align}\]In words: the means add and so do the variances.
But what if you observe $X + Y$ and you want to work backwards and produce your new best guess for the distribution for $X$? Let’s say you observed $X + Y = z$:
\[\begin{align} [X | X + Y = z] \sim \mathcal{N}\Bigg(\frac{\mu_x \frac{1}{\sigma_x^2} + z \frac{1}{\sigma_y^2}}{\frac{1}{\sigma_x^2} + \frac{1}{\sigma_y^2}}, \sqrt{\frac{1}{\frac{1}{\sigma_x^2} + \frac{1}{\sigma_y^2}}}\Bigg) \end{align}\]Wow… shoot me now. How would you ever remember that? Well, it helps to define what’s called precision. Precision is just one over variance: $p = 1/\sigma^2$.
Armed with this new notation, the formula becomes a lot more manageable:
\[\begin{align} [X | X + Y = z] \sim \mathcal{N}\Bigg(\frac{\mu_x p_x + z p_y}{p_x + p_y}, \sqrt{\frac{1}{p_x + p_y}}\Bigg) \end{align}\]It gets even nicer if we’re willing to parameterize normal distributions in terms of precision. Let’s say $\mathcal{N_p}(\mu, p)$ stands for a normal distribution with a mean of $\mu$ and a precision of $p$. Then we can say:
\[\begin{align} [X | X + Y = z] \sim \mathcal{N_p}\Bigg(\frac{\mu_x p_x + z p_y}{p_x + p_y}, p_x + p_y\Bigg) \end{align}\]In words: The posterior mean is a weighted average of the means (weighted by precision) and posterior precision is just the sum of the precisions.
Why go through this exercise? Because it means that we can produce an exact, closed form solution for the distribution of “true Elo” given a rating.
Our model says that true Elo is normally distributed with mean 1630 and stdev of 290 and a player’s rating is their true Elo plus a normal distribution with 0 mean and 110 stdev. So, given a player’s rating (which is analogous to $z$ in $X + Y = z$ above), we can produce the PDF for their true Elo.
We just went through a bunch of math symbolically which is a terrible way to gain intuition, so let’s use it in a concrete example. Let’s say we observe a rating of 2900. What is our posterior distributions for that player’s true Elo?
Plugging in numbers: $\mu_x = 1630$, $p_x = \frac{1}{\sigma_x^2} = \frac{1}{290^2}$, $p_y = \frac{1}{\sigma_y^2} = \frac{1}{110^2}$, so
\[[\textrm{true Elo}| \textrm{rating} = 2900] \sim \mathcal{N}(2740, 102)\]And as a graph:
Let’s sanity check this. The mean of the true Elo is a lot lower than 2900, which makes sense since true Elos are much more likely to be lower than 2900 in the “population”. The stdev is smaller than our initial stdev for true Elo (290), which makes sense since we’ve learned something by observing the rating. It’s also smaller than the noise in our observation (110) which honestly surprises me a little^{1}, but I’ve attempted to verify this with simulations and it seems to check out.
Let’s take 2 players with a rating difference of 525. Say their respective ratings are 1500 and 2025. We now can compute the PDF of true Elo for each player:
\[\begin{align} \textrm{PDF(true Elo difference | rating = 1500)} \sim \mathcal{N}(1515, 104) \\ \textrm{PDF(true Elo difference | rating = 2025)} \sim \mathcal{N}(1981, 104) \end{align}\]And to compute the PDF of the true Elo difference, we just need to subtract the two normal distributions, which we also know how to do in close form:
\[\textrm{PDF(true Elo difference)} \sim \mathcal{N}(466, 146)\]Last, but not least, we can compute the weighted average (integral) of the Elo formula given this PDF:
\[\begin{align} f(\textrm{Elo difference}) = \frac{1}{1 + 10^{\textrm{(-Elo difference)}/400}} \\ p(x) = \frac{1}{146\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-466}{146}\right)^{\!2}\,} \\ \int f(x) p(x) dx \approx 0.917 \end{align}\]Close enough?
Having dug into this a bit, I now realize that the posterior stdev is always smaller than both the prior stdev and the observation stdev. One way to think about this is that the formulas don’t really care which is the prior and which is the observation - it’s symmetric. So, if it makes sense to me that the posterior stdev should be saller than the prior stdev, then I can mentally use the same reasoning for the observation stdev. ↩
This post is largely based on these two great posts:
DNS is one of those things that I found unnecessarily mysterious and scary for too long. In retrospect, it feels pretty silly. Here’s my attempt to resolve the mystery of DNS.
DNS resolves domains like
vercel.com
into an IP addresses like76.76.21.21
.
To dissolve the mystery of DNS, we need to understand a bit about what a website is. A website is just computer program that’s running somewhere that, when asked, is willing to send back a bunch of HTML (probably along with some javascript and css). This is a really simple fact, but with all the modern fanciness of today’s web technologies (e.g. CDNs, “lambda functions”, “serverless”), I feel like it can get lost in the confusion.
The problem that DNS solves is: given a domain (e.g. google.com
), how do I find the computer program that’s willing to send me the right HTML (and javascript/css)?
There are about 350 million different domains accessible via the internet (as of June 2022). Each one has (at least) one corresponding computer program associated with it. We need a way to find it.
If you spend any time trying to google what DNS is, you’ll run into a bunch of terms which you might not be familiar with.
192.158.1.38
or 76.76.21.21
. Here’s a helpful analogy: An IP Address is to a host what a mailing address is to a house. In both cases, the address is a way to locate the object in question (a host or a house). As with mailing addresses, IP addresses have a particular structure to them.google.com
or vercel.com
. Domains exist for a few reasons, but a big one is so that you don’t have to remember website URLs that look like 76.76.21.21
. You can type vercel.com
into your browser instead and, let’s be honest, that’s a lot easier to remember. Another nice reason is that, the location (IP address) of vercel.com
might have to change from time to time. Maybe they used to run their web server on Heroku but later switched to an Amazon AWS host. When you change what host you use, the IP address changes (like how your mailing address changes when you move houses). It would be pretty painful if everyone had memorized the IP address of your Heroku host and then couldn’t find your website anymore when you moved to AWS. So, domain names serve as a nice level of indirection that insulates end-users from the nitty-gritty details of what hosts you’re using.DNS resolves domains like vercel.com
into an IP addresses like 76.76.21.21
.
So simple! Unfortunately, not so fast. If you go buy a domain name and a host (on AWS, for example) and you go to configure DNS for your domain, you’ll see advice such as “You should use a CNAME record to point your www subdomain to your apex domain.”. Uhhh… what?
acme.com
. A single apex domain can have many subdomains associated with it.docs
. Other examples would be www
in www.google.com
or blog
in blog.russelldmatt.com
.Why are subdomains useful? One reason is that is provides a way to organize your site. You can put some content in blog.russelldmatt.com
and other content in shop.russelldmatt.com
. But that’s not a great reason because you can provide organization in other ways, e.g. www.russelldmatt.com/blog
vs. www.russelldmatt.com/shop
.
The main reason (I think) is that you can point different subdomains to different IP addresses. That means shop.russelldmatt.com
can use a completely different web server than blog.russelldmatt.com
.
Here’s a great table describing the most common DNS record types.
To highlight the most important points:
vercel.com
into 76.76.21.21
.www.vercel.com
subdomain at your vercel.com
apex domain. It works as an alias for domain names that share a single IP address.At this point, I think you could actually go configure DNS for a newly purchased domain name and probably not get confused.
To put this into practice, let’s configure the domain motivatingexamples.com
(using GoDaddy) to point to a web server hosted by vercel
. We will follow the instructions so nicely laid out here.
Using vercel’s website, I first select the project that I want to use for motivatingexamples.com
and then click view domains
.
Next we add our domain motivatingexamples.com
to our project, like so:
Go to your GoDaddy account to manage your DNS. Navigate to your domain list. Select the domain you want to point to your vercel app.
In the Domain Settings, click on the Manage DNS link to configure your DNS. Configure an A record to point this domain at vercel’s IP address. Then, so that people can type www.motivatingexample.com
and get to the right place, add a CNAME record that points the www
subdomain to the apex domain motivatingexamples.com
. Like so:
The second law of thermodynamics says that entropy always increases. But what is entropy?
I’ll admit, it took me a long time to understand entropy. I read lay explanations of entropy that said vague things like “entropy is a measure of disorder”, and it didn’t click. It wasn’t until I watched this video, which gave a precise definition of entropy, that I felt like I finally got it.
To understand entropy, you must first understand the difference between microstates and macrostates. Let’s define these concepts in the context of an example: a silverware drawer. This particular silverware drawer has three sections: one section for forks, one for spoons, and one for knives. In addition, there are 5 forks, 5 spoons, and 5 knives in the silverware drawer.
A microstate is a full description of the system. In our example, it tells you exactly where every single fork, spoon, and knife is in the drawer. Here’s an example microstate: there are 5 forks and 3 spoons in the fork section, 3 knives in the spoon section, and 2 spoons and 2 knives in the knife section. Here’s another microstate: all the utensils are in the spoon section.
How many microstates are possible in our silverware drawer? We have 15 utensils, each of which can go in any one of the three sections, so \(3^{15}\) microstates.
A macrostate is a higher level description of a system. For example, the drawer could be “messy” or “organized”. The key is that more than one microstate can have a given macrostate. Let’s consider the macrostate of the silverware drawer being “messy” and let’s be precise about what “messy” means. Let’s say the drawer is messy if more than one utensil is in the wrong section. How many microstates would be classified as “messy”? Since this isn’t a post about combinatorics, I’ll just tell you: all but \(31\). That means \(3^{15} - 31\) microstates would be considered “messy” (according to my totally arbitrary definition).
With that background, we can introduce the formula for the entropy of a given macrostate:
\[S = k_b ln W\]where \(S\) is the entropy of the macrostate, \(k_b\) is a constant, and \(W\) is the number of microstates that have that particular macrostate. I don’t particularly care about the constant, just the fact that entropy is proportional to the log of \(W\).
So which has higher entropy, the macrostate of the silverware drawer being “messy” or “not messy”? Clearly messy. There are way more microstates that are messy than not. The entropy of the “messy” state is about \(ln(3^{15}-31) \approx 16.5\), while the entropy of the “not messy” state is about \(ln(31) \approx 3.5\) (ignoring the constant).
So when lay explanations colloquially say that “entropy is a measure of disorder”, it’s because – more often than not – there are more ways for a system to be “disorderly” or “messy” than “organized” or “neat”.
Let’s work through another really commonly used example in entropy explanations. Let’s consider a box full of gas particles. The question is: which macrostate has higher entropy, the gas particles being all on one side of the box or the gas particles being generally evenly spread out throughout the box?
As usual, let’s try to make this precise (and tractable). Let’s say there are 100 gas particles in the box and we are going to keep track of how many particles are on the left and right halves of the box. Let’s say that the particles are “well mixed” if there are at least 40 particles on each side.
First off, how many microstates are there? We have 100 particles, and for each one we’re keeping track of which side it’s on, so that’s \(2^{100}\) microstates.
How many microstates are “well mixed”? Again, not a combinatorics lesson, but using the binomial distribution we can figure out that the probability that either side has less than 40 particles is about 3.5%. So, the macrostate of “well mixed” has about 27 times the number of microstates as the macrostate of “not well mixed”. That means that “well mixed” has higher entropy – in fact, it has \(ln(27) \approx 3.3\) more entropy than “not well mixed”. Honestly, the numbers don’t really matter. It’s more important to understand that more microstates mean higher entropy.
Now that we understand what entropy is, let’s revisit the second law of thermodynamics:
Entropy increases because there are more ways to be in a high entropy state than a low entropy state. Put that way, it sounds so obvious that it’s almost a tautology. Higher probability things happen with… higher probability? Yeah… sure, I guess.
I’m probably glossing over some important details. And I don’t claim that we’ve actually fully explained the second law of thermodynamics. In particular, a system must be able to evolve over time in order for entropy to increase. And the way a system evolves is usually constrained.
For example, consider the gas particles in a box. If the system evolved by picking a random microstate from one moment to the next, then entropy would increase by construction. You’d necessarily jump to higher probability macrostates with… higher probability. But that’s not how the system evolves. The gas particles are constrained to move to a location close to their current position. However, if you wait for long enough, modeling the particles as jumping to a randomly chosen location within the box is actually probably a decent model since gases are so chaotic. So, given enough time, entropy will very likely increase.
Oh, and that that leads me to my last bone to pick with lay explanations of entropy. Saying that entropy always increases is misleading! It implies, at least to me, that it’s some physical law that is always obeyed, like gravity. Instead, we should say that entropy very (very) likely increases, which conveys the fact that this is a probabilistic statement that relies on probabilistic arguments.
Anyways, I hope that clears up any confusion about what entropy is for you. And if not, I highly recommend watching this video.
]]>I try to write a summary of what I learned from a book soon after I finish reading it so that I can remember it better. For this particular one, I guess I don’t really need to because the author James Clear did it himself. But I’m going to anyways because I’ll remember it better if it’s in my own words.
My personal takeaways are as follows:
Here are the really useful cheat sheets that James Clear put together:
]]>$ brew install ngrok/ngrok/ngrok
Sign up (with Github) for a free account: https://ngrok.com/
Then, follow the instructions for step 2. Connect your account on this page: https://dashboard.ngrok.com/get-started/setup
In particular, add your authtoken:
$ ngrok config add-authtoken [LONG PRIVATE-KEY-ISH THING]
I removed the long private-key-looking thing that it gave me in case it is indeed a private key.
Serve!
$ ngrok http 5175
ngrok (Ctrl+C to quit)
Check which logged users are accessing your tunnels in real time https://ngrok.com/s/app-users
Session Status online
Account Matt Russell (Plan: Free)
Version 3.1.0
Region United States (us)
Latency -
Web Interface http://127.0.0.1:4040
Forwarding https://2f23-2601-640-8581-24b0-7828-89e1-c9b5-280a.ngrok.io -> http://localhost:5175
Connections ttl opn rt1 rt5 p50 p90
0 0 0.00 0.00 0.00 0.00
Send yourself a text message with the crazy long URL https://2f23-2601-640-8581-24b0-7828-89e1-c9b5-280a.ngrok.io (your URL will be different).
Click on the blue “visit site” button:
And you’re done! Now you can see how absolutely terrible your website looks on a phone before you deploy it.
]]>But also… isn’t that what programming is all about at some level? Not having to understand how everything works is how stuff actually gets done. What if “Intro to Python” started with: Let’s learn about how transistors work. Then we can build logic gates and eventually CPUs. After that we’ll build an OS, and then a compiler. And maybe in a few years we’ll be at the point where we actually know what happens when you write
print("hello world")
As a brief tangent, the steps I outlined above loosely follow the syllabus of an amazing book/course Nand to Tetris that I highly recommend. But even that needs to start somewhere! And it starts by assuming the existence of a nand (not and) gate. How does that work? I dunno, physics.
I think another word for magic is abstraction. Abstraction is not understanding how something works, but still being able to use it. It’s like blurring your eyes a bit and saying: Ok, I don’t totally understand how this works behind the scenes, but I see it has a green button to start it and a red button to stop it and that’s enough for me right now.
But abstraction is good and magic is bad. Why?
At some level, it’s definitional. I think one could define magic as “leaky abstractions”. A leaky abstraction is one where you kinda do need to understand how it works. In other words, it’s an abstraction that’s bad at its only job: letting you not understand the details. Leaky abstractions leak some details that are supposed to be abstracted away.
Anyways, I’ve been thinking about all this in the context of a new UI framework that I’ve been learning called svelte. Overall, it’s kind of amazing. It does so much for you and makes everything so easy. It’s like magic.
However, that magic came back to bite me yesterday. I wrote some code and it didn’t work as I expected. The details aren’t really that interesting, but if you care, I was trying to access the value of a svelte store using the $
syntax and my store had a value of undefined
, even though I had provided an initial value. So I did what any programmer does: I tweaked the code in a myriad of ways to see which versions worked and which versions didn’t to give me some intuition about what was going on behind the scenes. Already, this is sounding a bit like the bad kind of magic. I had something that normally “just works” and all of a sudden it didn’t, and now I need to understand the details.
Eventually I whittled it down to a minimal reproduction where I had two version of the code: one that worked and one that didn’t. And the only difference between the two versions was whether or not the <script>
tag had lang="ts"
or not! 🤯
This is the bad kind of magic. The kind of magic where if you look at it the wrong way, it stops working. This kind of magic is how developers become superstitious. You can’t do X
. Why? I’m not sure, but when I do X
my code stops working and I’ve never been able to fully understand why.
This is particularly ironic because overall I’m quite enamored by svelte and so I’ve mentioned that I’m learning it to a few people. And one of their responses was effectively: Yeah, it does look really cool. It’s just a bit too much magic for me. Up until yesterday, I hadn’t felt that. Now I have.
I’ll end with two important caveats:
I’m new to svelte, and there is very possibly a good explanation for the crazy behavior that I’m seeing. The problem might be between the keyboard and chair.
I’m still really excited about svelte! Everything has a few sharp corners and I’m not going to give up on it just because I ran into one.