Maybe obvious advice from a book about sibling rivalry, but good nonetheless. The basic idea is to describe the goodness or badness of what one child does without any reference to the other child. If they do something wrong, don’t say things like “Your brother would never do that”. And if they do something good but then put themselves down saying something like “well, it’s still not as good as Michael”, say “We’re not talking about Michael, we’re talking about you. I’m proud of you!”. Make it just about them.
Similarly, when spending quality alone time with one child, make it just about them.
Don’t frame everything as “relative to your sibling” because that framing inevitably leads to competition, rivalry, and negative feelings.
Quote from the book:
- Insisting upon good feelings between the children led to bad feelings.
- Acknowledging bad feelings between the children led to good feelings.
Many parents found that instead of “trying to force their children to get along”, acknowledging that certain parts of having brothers or sisters is frustrating was a better strategy. This feels very related to the next section (Make it clear you understand how your child feels).
Vocalize/narrate their feelings for them. Say things like: “Wow, you’re really mad aren’t you? You really didn’t like when Bobby grabbed that toy.” This also allows you to:
This was probably the largest recurring theme in the book. Within any given conflict resolution strategy, this seemed like the “meta strategy”. No matter what specific thing was going wrong, narrating your children’s emotions was part of the solution.
I strongly agree with this, although it’s hard to do in practice. But alone time with 1 kid is often so much more special and quality than trying to get through the day with all 3 (or 4)!
This wasn’t a huge part of the book, but in the last chapter there were a few anecdotes about families who scheduled “official family meetings” where they “invited” each family member to attend to talk about things like
and it seemed to foster good feeling about the family. This feels in-line with being open about/acknowledging feelings.
We’ve never done this, but I can see it being a good idea.
The main idea is not to put anyone in a box. Don’t, for example, say this is the serious one and this is the silly one, or this is the athletic one and this is the smart one, etc.
And don’t just not say the words - actually mean it. Have high expectations for kids who are struggling in school, for example. They will react to your expectations for them.
Just because one of your children is more studious than another one doesn’t mean the other kid isn’t studious in any absolute sense. In another family, they might be the studious one! Let every kid express every part of themselves (to the extent they want) and not feel like “well I shouldn’t even try to be _ because my brother is the _ one”.
I think this is pretty important advice for us; I can see us struggling with this.
]]>This post is attempting to answer one of the questions at the end of Let’s build an operator which introduces the idea of hyperoperations. It’s not strictly necessary to read that post first, but I’d recommend it.
Repeated addition turns into multiplication. Repeated multiplication turns into exponentiation. What does repeated exponentiation turn into? Well, that’s tetration.
So here’s the main question of this post: What is $\tetration{3}{\frac{1}{2}}$? As in, I’d like you to make an exponential tower of 3’s. How many 3s in the tower? $\frac{1}{2}$ of one.
Nonsense! is your first reaction. But then you remember that you can do that for addition and multiplication. Sure, but adding three “one half times” is totally fine - that’s just $\frac{3}{2}$. And multiplying by three “one half times” is… oh right, that’s just $\sqrt{3}$.
For this post I’m going to ask you to do something that’s honestly pretty hard. Try your best to forget that you know about square roots. Try to put yourself into the head of someone who knows how to multiply, but has not yet learned about square roots. Ok ready?
Someone asks you to multiply 3 by itself 4 times, no problem. I can write that out as $3 \cdot 3 \cdot 3 \cdot 3$ and work it out. Nothing complicated about that. Now someone asks you to multiply 3 by itself 1 time. Also easy. Now someone asks you to multiply 3 by itself $\frac{1}{2}$ times. Huh?
Like honestly, this is pretty weird (and stepping out of character for a second: just because we’ve given a name to this idea “square root” doesn’t make it any less weird). What does the question even mean?
Well, it’s not obvious, but one way to make progress is by noticing the following: If I multiply $3 \cdot 3$ and then multiply that by $3 \cdot 3$, that’s the same as multiplying 3 by itself 4 times (since $2 + 2 = 4$): $3 \cdot 3 \cdot 3 \cdot 3$. So maybe whatever the answer to “3 multiplied by itself $\frac{1}{2}$ times” should have the property that if you multiply that by itself, it should be the same as “3 multiplied by itself 1 time” (since $\frac{1}{2} + \frac{1}{2} = 1$). And 3 multipled by itself 1 time is just 3.
That’s something I can work with. Now I’m just looking for some other number that, when multiplied by itself produces 3. Using guess and check, I find that it’s something around 1.7.
Armed with that line of reasoning, what if someone asks “What is 3 multiplied by itself $\frac{1}{3}$ times?”. I’d find some number $x$ such that $x \cdot x \cdot x = 3$.
Before moving forward to tetration, let’s actually go backwards to addition. Does this sort of thinking work for addition? If you want to add 3 to itself $\frac{1}{2}$ times, you need to find some other number $x$ such that $x + x = 3$. It works!
Now back to our original question: What is 3 raised to itself $\frac{1}{2}$ times? Well, I think I just need to find some number $x$ such that $x^x = 3$. I’m not sure how to do that efficiently, but using a computer and guess and check I find that $x \approx 1.82$.
Success! We did it. You made it seem like this was going to be hard.
Well, first off, it was hard! Realizing how to extend what we already know with addition and multiplication took some real creativity. But yes - unfortunately we’re about to see some problems.
Back to addition: We found that $3 \cdot \frac{1}{2} = 1.5$ because $1.5 + 1.5 = 3$. That was the most natural way think about it, but we could have used this line of reasoning: I need to find $x$ such that $x + x + x + x = 3 + 3$. Right? 4 halfs is 2, so adding 4 of these weird “half 3s” should be the same as adding 2 3s. Luckily for us, if we solve for $x$, we still get $1.5$.
Now let’s check multiplication: We found that $3^{\frac{1}{2}} \approx 1.7$ because $1.7 \cdot 1.7 \approx 3$. That was the most natural way think about it, but we could have used this line of reasoning: I need to find $x$ such that $x \cdot x \cdot x \cdot x = 3 \cdot 3$. Right? 4 halfs is 2, so multiplying 4 of these weird “half 3s” should be the same as multiplying 2 3s. Luckily for us, if we solve for $x$, we still get about $1.7$.
Now let’s check tetration: Instead of finding $x$ such that $x^x = 3$ (which was about $1.82$), let’s find $x$ such that $x^{(x^{(x^x)})} = 3^3$. Unfortunately, we get something different: $x \approx 1.806$.
Why did this “just work” for repeated addition and multiplication but not exponentiation? I think the reason is because addition and multiplication are associative but exponentiation is not. Here is what I mean:
\[\begin{aligned} a + (a + (a + a)) &= (a + a) + (a + a) \\ a \cdot (a \cdot (a \cdot a)) &= (a \cdot a) \cdot (a \cdot a) \\ a ^ {(a ^ {(a ^ a)})} &\neq (a ^ a) ^ {(a ^ a)} \end{aligned}\]What this means is that the solution to $x^x = 3$ is not necessarily the same as the solution to $x^{x^{x^x}} = 3^3$.
I’m not sure and a quick google suggests maybe there is no consistent way to extend tetration to work with non-integer “exponents” (for lack of a better word). Maybe you can think of another way to define $\tetration{3}{\frac{1}{2}}$ that generalizes in a way that my version didn’t.
Even though this attempt failed, I learned a lot and found the entire process quite satisfying. At the risk of sounding overly preachy: I find this sort of exploratory work to be such a beautiful side of math that we don’t really get to experience growing up. In particular, taking an idea that is familiar and extending it into a domain that feels unnatural is a great way to generate interesting open-ended problems. My first introduction to this way of thinking was in the book flatland and it’s still my favorite book to this day.
Earlier I said “4 halfs is 2”. Of course that’s true. But… what if it wasn’t? We’ve discovered that the reason it’s true is that addition is associative.
\[\begin{aligned} \frac{1}{2} + (\frac{1}{2} + (\frac{1}{2} + \frac{1}{2})) = 2 \\ (\frac{1}{2} + \frac{1}{2}) + (\frac{1}{2} + \frac{1}{2}) = 2 \\ 1 + 1 = 2 \\ \end{aligned}\]But what if addition wasn’t associative? Then:
\[\begin{aligned} \frac{1}{2} + (\frac{1}{2} + (\frac{1}{2} + \frac{1}{2})) \neq (\frac{1}{2} + \frac{1}{2}) + (\frac{1}{2} + \frac{1}{2}) \end{aligned}\]I think the implication of this is that we would not be able to simplify fractions:
\[\begin{aligned} \frac{4}{2} \neq \frac{2}{1} \end{aligned}\]It sounds impractical and strange to not be able to simplify fractions. Each rational number would be like it’s own little island and it would be harder to relate one to another, but… maybe it could work? I think that’s the sort of world we live in when it comes to tetration:
\[\begin{aligned} \tetration{3}{\frac{2}{4}} \neq \tetration{3}{\frac{1}{2}} \end{aligned}\]]]>In this post we’re going to build an operator together. Actually, not just one operator, but a whole family of operators.
What do I mean by “an operator”? I mean something like \(+\), \(\cdot\), or \(-\). Also, all the operators we’re going to build are “binary” operators meaning they take 2 numbers as inputs and produce 1 number as an output (again, like plus, minus, etc). Actually the first one doesn’t really need to be a binary operator, but we’re going to make it one just for consistency.
Oh, and one more thing before we begin. This, like any math exposition, will be much more interesting if you take an active role. Don’t just read this from top to bottom. Pause and try to predict the next step for yourself. Work out the examples for yourself before looking at my solutions.
Ok let’s get started! Here’s our first “binary” operator:
In case this is confusing, let me explain.
The symbol I’m using for my operator is a big 𝙾. Why that one? I dunno, why is plus a cross? Why is multiplication sometimes a star and sometimes a dot (and sometimes just omitted)? I just wanted to pick a symbol that wasn’t already used and I picked a big 𝙾.
Ok, so our operator takes 2 numbers, which I’ve denoted as \(a\) and \(n\) and just returns \(n+1\). Yes, it completely ignores the first number and does nothing with it. That’s why I said before that this first operator is kinda of a fake “binary” operator (it’s really a unary operator that only needs 1 number as input).
Sanity check: let’s make sure you understand this operator. Can you solve these exercises?
What’s next? Well, we’re going to define our next operator in terms of the last one (and itself).
This one is more confusing, so bear with me.
The symbol I’m using for this operator is two big 𝙾s. This one really is a binary operator and takes two numbers as inputs which (again) I’ve denoted as \(a\) and \(n\).
If \(n = 0\), this operator returns \(a\). If \(n > 0\), then this operator is defined recursively in terms of itself and the operator we started with (the one with just one 𝙾).
I suspect working through an example will be much more helpful for this one.
Did you get it? If not, that’s totally OK – I grant that it probably wasn’t super obvious what I was going for yet – but from now on I think it’s extra important to work through the examples yourself first.
I went through that example in meticulous detail, step by step, but all those brackets make things pretty hard to read so, if you don’t mind, I’m going to write things without them in the future – like this:
Just remember that when you evaluate, you always evaluate from right to left.
Now that hopefully you know what I’m asking, here are two more examples for you to work out on your own:
Now that you have a couple examples under your belt, do you have a guess at what this operator is, in disguise?
(it’s plus!)
Let’s move to the next one. The definition for the next operator looks almost identical to the last:
I’m hoping you’re getting the hang of things by now so I’m going to write less words and jump right to the exercises:
At this point (just like last time), it’s useful to try to guess what this operator is in disguise. Try more examples if you have to.
Here’s my take:
(it’s multiplication!)
Before we even begin, can you guess how we will define the next operator? I bet you can get pretty close, although I suspect you might get the \(a𝙾𝙾𝙾0\) wrong.
And before even working out any examples, can you guess what this operator will be in disguise?
Either way, here’s your exercise:
Let’s take a second and realize what we’ve done. We’ve defined addition in terms of +1 (the successor operator). We’ve then defined multiplication in terms of addition. And lastly we’ve defined exponentiation in terms of multiplication. It’s not groundbreaking, but it’s pretty cool!
Here’s a recap in picture form:
At this point, I think the “real math” can start! It’s time for me to stop telling you exactly what to do and time for you to start exploring. But in case it’s helpful, here are some questions that come to mind – at least for me. And I’ll probably explore some of them in future posts.
Have fun!
Unsurprisingly, the idea of building up operators recursively is not my idea. These are called Hyperoperations. Apparently, this formulation was first written down by Reuben Goodstein and you’ll find a version of it at the bottom of page 7.
If you want to turn this idea into a puzzle for your friends, you can use Goodstein’s function (written in python) and just ask them to explain what this function does:
def g(k, a, n):
if k == 1:
return n + 1
elif k == 2 and n == 0:
return a
elif k == 3 and n == 0:
return 0
elif n == 0:
return 1
else:
return g(k-1, a, g(k, a, n - 1))
Let’s play a game. I’m going to draw two points on an x-y plane and give you a rule which you can use to generate new points. You can generate as many new points as you want using my rule. If you connect all the points, you get a function. Your goal is to guess the function. Ok? Let’s go.
Ok, I’ve drawn two points, \(a\) and \(b\). From these two points, you can generate a new third point \(c\), which I’ve drawn in purple. The x coordinate of \(c\) is the product of the x coordinates of \(a\) and \(b\). The y coordinate of \(c\) is the sum of of the y coordinates of \(a\) and \(b\). That’s the rule.
To restate, here is my rule:
From any 2 points, \(a\) and \(b\), you can generate a 3rd point \(c\) where \(c_x = a_x \cdot b_x\) and \(c_y = a_y + b_y\).
So… what’s the function? I’m looking for an answer of the form:
If you think you have the answer, here are some follow-up questions:
I believe pretty strongly in evolution. I’m quite skeptical of people who argue against it. But if you asked me to lay out the strongest case for evolution, until recently… I’d be pretty hard-pressed? I’d probably mumble something about fossils and then get kind of embarrassed.
If I can’t make the case, then why do I believe it so strongly? I think the honest answer is that I find the people who argue for evolution much more credible than the people who argue against it. It seems pretty clear to me that evolution is a widely accepted scientific consensus, and I trust scientific consensus unless I have a really really good reason not to. And yes, I’m sure you can find a good scientist who rejects it – but you can find good scientists with almost any view you want if you look hard enough. Knowing why X is true vs. thinking X is true because a smart person you know believes X is what I’d call “first order knowledge” vs. “second order knowledge”. I think good second order knowledge is a totally valid reason to believe something. Second order knowledge is kind of fascinating in its own right, but let’s leave that for another post.
Historically that’s why I’ve believed in evolution and I think that’s a fine reason, but… I kind of want to be able to make the case myself? I want to be able to make the case for evolution from first principles.
So I read The Greatest Show on Earth in which Dawkins argues that evolution is undeniable (although it’s denied every day, a fact which clearly irritates him to no end). Here’s my attempt at summarizing his 500 page book into a very short blog post so that my future forgetful self can remind himself of the main argument from time to time.
You only need to accept two very plausible sounding premises for evolution to become almost a logical inevitability.
If there is variation among individuals within a species, then surely that variation will have some impact, however small, on the likelihood of reproduction. If the variation is hereditable, then the variation which improves the likelihood of reproduction will be passed on more than than the variation that decreases the likelihood of reproduction.
If you let this process iterate for a really really long time – an unfathomably long time – you’ll end up with things as different as birds, sharks, and humans all from a common ancestor.
I think the best evidence is that we can directly observe both these premises on human timescales. Consider selective dog breeding, or banana farmers selecting for larger bananas. In a few hundred years you can end up with a drastically different gene pool for dogs or bananas. In those examples, one can observe both the fact that variation is introduced from generation to generation, and that the variation is hereditable via selective breeding.
In the cases where we can directly inspect DNA, we can observe mutations from one generation to the next (presumably random, although that seems hard to prove). And we can observe that those mutations are decently likely to be inherited by the next generation.
In order to evolve from a single celled organism to a human you need countless mutations, each of which is incredibly unlikely. If you multiply those tiny probabilities together, the probability of this happening is essentially zero.
First off, the timescale of evolution is hard to fathom.
Second, saying the probability of our particular sequence of mutations is essentially zero feels similar (to me) to drawing a number between 1 and a billion, getting 487223, and saying the probability of getting that particular number was essentially zero. Any time you observe one particular outcome out of an immense set of possibilities, the probability of that particular observation a priori will be very small, but something had to happen. If you draw a number of between 1 and a billion, you will - with 100% probability - observe an event which was only 1 in a billion to happen. There’s a sort of meta perspective which makes it unsurprising to observe such a “surprising” event.
Our particular sequence of mutations was exceedingly unlikely – granted! – but it didn’t have to happen that way. This is made clear by the fact that apes went through a different sequence of mutations, as did sharks, as did mushrooms.
If both humans and whales evolved from the same common ancestor, where are the whale-people? Or where are the whale-people fossils? How come you can’t find any?
This is just a straight up misunderstanding. Evolution does not predict that there will be species A and B and then every variation in between. What it predicts is that there will be some common ancestor of A and B that lived probably a super long time ago that also probably doesn’t look anything like either A or B does today!
For A and B consider humans and whales. According to google, the last common ancestor of humans and whales was a small land-dwelling shrew-like creature. That’s nothing like a either a human or a whale. It’s not that humans evolved from whales or that whales evolved from humans, it’s that we both evolved from shrews.
So you shouldn’t expect to see - even in the fossil record - some whale-human fossil. You might expect to find something that’s kind of between a whale and a shrew, and something else kind of between a human and a shrew, but even that is probably very oversimplified.
Biologists like classifying stuff. Animals are in a different “kingdom” than plans or fungi. Within the animal kingdom, animal species are separated into different “phylum”, e.g. spiders are in the “arthropod” phylum and tigers are in the “chordata” phylum.
So what phylum was the most recent common ancestors of spiders and tigers in? It’s really hard to say. It’s probably not even well defined.
I suspect this is NOT true, but let’s say the most common ancestor of tigers and spiders looked almost identical to a spider. Let’s just say it was a spider, species-wise. You trace the lineage between that common ancestor and some particular tiger today. At what point did the species change from spider to tiger? More realistically, that lineage went through many different species, so let’s just ask: what was the first common ancestor that wasn’t a spider?
Isn’t the definition of species “a group of organisms that can reproduce naturally with one another”? Do you think there was ever a particular generation which was so different from the last that it couldn’t interbreed? I suspect not. So… that means that we can create a chain from spider to tiger where every link in the chain should be considered the same species as the last, and yet… we end up with a different species?
The answer is that all this is fuzzier and more continuous that biology class might make it seem sometimes. Classifications are useful in practice but the idea that you can cleanly divide all living things (including common ancestors) into different groups based on the ability to interbreed is just false.
Dawkins like the analogy of sculpting, because a sculptor – unlike other artists – is mostly in the business of subtracting. A sculptor starts with a solid block of marble and gradually cuts away pieces until a statue remains.
Evolution uses a similar process. It gradually cuts away genes that are not advantageous (via making reproduction less likely) from the gene pool. Any given animal/gene is not directly affected by evolution, but by genes being either more or less likely to be passed on via reproduction, evolution sculpts the gene pool over generations.
Dawkins described a pretty simple way to conceptualize evolution in terms of a simple computer program.
Let’s assume we start with a population has a uniform distribution over some attribute - say height. At each time step, we simulate reproduction (asexual, to keep things really simple) and we give a slight reproductive advantage to members of the population with height above some arbitrary threshold. Here is one possible outcome of such a simulation:
]]>Matter is made of atoms. Atoms come in different flavors, which we call elements. The number of protons is fixed for any given element, and is equal to the number of electrons. So, for example, carbon atoms have 6 protons and electrons while nitrogen atoms have 7.
The number of neutrons, however, is not fixed. Carbon atoms can have different numbers of neutrons. These different variations are called “isotopes”. The most common three isotopes of carbon are carbon-12 (98.9%), carbon-13 (1.1%), and carbon-14 (<0.01%). These isotopes have 6, 7, and 8 neutrons respectively. The number associated with each isotope is the “mass number” which equals the number of protons + neutrons – electrons are so light they don’t contribute to the mass number.
Certain isotopes are unstable, by which I mean they spontaneously decay into something else, at a predictable rate. The predictability of the rate of decay is key.
The decay tends to happen in one of three ways:
For example, potassium-40 decays into argon-40. The favored measure of decay rate is called the “half-life”, which is the amount of time it takes for half of the potassium-40 atoms to decay into argon-40 atoms. For this particular pairing, the half-life is 1.26 billion years.
Why does this help? Well, if you happen to know that at some moment in time in the past, let’s call it time X, there was only potassium-40 and no argon-40, then by measuring the ratio of potassium-40 to argon-40 now, you can compute the amount of time that has passed since X. It’s important to note that this only works if you know that, at time X, there was no argon-40. In this case, igneous rocks are solidified from molten rock (magma or lava) and at the moment of solidification, the rock contains potassium-40 but no argon-40.
So let’s say you take an igneous rock and measure the ratio of potassium-40 to argon-40 and it’s 0.5. That means half the potassium-40 has decayed into argon-40. Since we know the half-life of potassium-40 is 1.26 billion years, that means the rock was formed about 1.26 billion years ago. What if the ratio was 0.25? That means the amount of potassium-40 has halfed twice, so it’s been 2.52 billion years since that particular rock was formed. Hopefully it’s clear how this generalizes.
And here we come to the point of this post. It seems to me that this real-world scenario can motivate a host of interesting math questions, across a surprisingly broad range of difficulty levels.
For elementary school, you can set up the ratios to form an integer number of half-lives:
For high school, you can make the ratios whatever you want, which requires logs.
You can also get into “experiment design” or “thinking like a scientist”:
For college, you can start to introduce differential equations.
You can make this really hard:
Is it harder if you make a loop?
Other people at my company have written less general frameworks. They put restrictions on the type of programs that you’re allowed to write within their framework. This is all a little abstract, so here’s an example:
One common restriction is that your component structure will be a DAG. By DAG, I mean a “directed acyclic graph”. The key word here is acyclic. If component A knows about component B, then B can’t know about A. There is a sort of one-way directionality to the arrangement of the components.
Oh yeah, what even is a framework? Obviously they come in all shapes and sizes, but I think a pretty common theme is they have some conception of a “component” and that an application is built by composing the components together in some way. Again – very abstract, but useful as a way to talk about frameworks.
Generally speaking, components talk to each other. They pass data between one another. This leads me to another example of a framework which puts a rather severe restriction on the applications that can be built within it. The framework wants to control the flow of data between components, and therefore provides a single mechanism for writing data. The particular mechanism isn’t very interesting, it’s effectively a function you can call to write a single “atom” of data to downstream listeners.
This has big implications. For example, it means that data flow is push-based and not pull-based. You can’t “ask for the next piece of data”, you just get it whenever it’s available, process it, and then maybe push data downstream. Since the data flow is not pull based, it rules out interactions like a component saying “something changed and now I’d like data for stock B in addition to what you’re already sending me”.
My framework doesn’t constrain applications like that. It’s super flexible. I thought that was a good thing. But there are two major costs for that level of generality:
People ask, so what can components of your framework do?. The answer is anything. That’s cool, I guess, but it certainly doesn’t help me understand.
The more concrete and specific a thing is, the easier it is to understand (usually). The more abstract and general a thing is, the harder it is to understand.
What does the framework actually do? The answer is not much. It’s more of a way to structure your program than a thing that actually provides functionality to you at runtime.
Why? Because it kind of can’t do much. By being completely agnostic to, for example, how components communicate, it can’t publish metrics on how much data is flowing between which components. It definitely can’t run two components on different processes because it has no idea how to ferry data from one component to another.
One way to think about this is that every restriction a framework puts on the application is giving the framework information about how the application will (or will not) behave. Sometimes, that information is really useful and enables the framework to do non-trivial work for you.
Even if I think my own, extremely general framework. The most important part is what you aren’t allowed to do. Certain components aren’t allowed to do IO. By adding this restriction, we can create applications that are much more testable, and can even be simulated using historical data.
But what about the fact that every restriction limits the type of programs that work within that framework! Don’t you want your framework to be as widely applicable as possible?
Totally. Like almost everything else in life, it’s a trade-off. There’s a spectrum between broadly-applicable-but-only-a-little-helpful to narrowly-applicable-but-extremely-helpful.
The idea that the restrictions is where the value comes from reminds me of another domain: programming languages.
There are languages out there that are extremely flexible. They let you do basically anything. You want to add a number to a string? Sure! You want to subtract a number from a string? No problem!
'10' + 3; // '103'
'10' - 3; // 7
I don’t like these languages. Don’t get me wrong, I’m not here to hate on python or javascript. I use them and they’re incredibly useful. But given the choice, I’ll take a strongly typed compiled programming language any day of the week (especially for large programs). Why? Because it stops me from doing crazy things like trying to do math on strings.
Python is a nice example because you can write it with or without types. Adding type annotations to a python program is very clearly a restriction on the set of programs you can run. You write a program in python. You can run it! You add type annotations to that program and typecheck it. Maybe you can still run it? Or maybe it doesn’t compile. The “value” comes from stopping you from running programs that don’t typecheck. There is no new functionality magically comes from typechecking. It’s just a way to stop you from doing things that are probably (not not necessarily!) wrong.
I recently listened to a podcast about programming where they discussed this trade-off between broadly-applicable-but-only-a-little-helpful and narrowly-applicable-but-extremely-helpful in a different context. Here’s a quote (by Yaron Minsky):
]]>But, like, in some sense the scale of optimizations are very different. Like, if you come up with a way of making your compiler faster that, like, takes most user programs and makes them 20% faster, that’s an enormous win. Like, that’s a shockingly good outcome. Whereas, if you give people good performance engineering tools, the idea that they can think hard about a particular program and make it five, or 10, or 100 times faster is like, in some sense, totally normal.
I work in the stock market. If I told you that there’s a 25% chance that apple stock was going to be worth \$200 and a 75% chance it would be worth \$160, what is the most you would pay for 1 share? I claim the “right” answer is $0.25 \cdot \$200 + 0.75 \cdot \$160 = \$170$. In other words, I think the “fair” value of the stock is $170. That also happens to be the mean. Why didn’t I pick the median or the mode?
I think it comes down to how bets work in the stock market. In the stock market, if you pay \$X for something and it turns out to be worth \$Y you get (or pay) \$(Y - X). When bets work that way, the optimal bet to make is to buy below the mean and to sell above the mean. The mean is the value where you don’t expect to make or lose money regardless of whether you buy or sell. It’s “fair”.
What if bets worked differently? Right, that’s where I was going with this.
What if the stock market worked totally differently? What if you had to make a guess at where the stock was going to end up at the end of the year. For simplicity, let’s say stock prices always got rounded to the nearest dollar. If you guesed right, you get \$100. If you guessed wrong, you get \$0. Now, assuming aaple stock had the same possible outcomes (25% chance of \$200 and 75% chance of \$160), what would you bet?
\$160 of course. You’d bet the mode! All that matters is whether you’re right or not, and the mode is the most likely value to be right.
Ok, but that seems really weird and contrived. I kind of agree, but isn’t that kind of how horse races work? You just bet on a horse and you get money if you’re right? Ignoring all the complexity of odds, you’d want to just bet on the mode (the horse that’s most likely to win).
Now let’s come up with a betting structure for which the median is the “fair” value. I even bet you’ve used this one before with your friends. The way the bet works is that you each put up \$20 and guess the value of something. Whoever is closer gets the money.
In this case, you should bet the median! Half the values are lower than you, half the values are higher than you. Whatever your opponent guesses will be right less often than your guess. It’s optimal!
I don’t have any insightful grand finally here. I just found it interesting that I almost never think about values that aren’t “the mean” (although I do think a lot about variance and correlations) and I think this is a pretty plausible explanation as to why. If I was in a business where “closest wins”, I bet I would care a whole lot more about the median than the mean.
]]>In How good is Elo for predicting chess?, we observed that the Elo formula systematically overpredicts the expected score of the better player. For example, if one player has an Elo rating that’s 400 above their opponent, the Elo formula predicts an average score of 0.91 (which you can very approximately interpret as a 91% chance of winning), however empirically that player only averages a score of about 0.85 (using a dataset of about 10 million online games played on https://lichess.org).
In Noisy Elo, we made a guess at what might explain the underperformance of actual expected scores relative to the Elo formula’s prediction: maybe Elo ratings are a noisy measurement of true Elo ratings. We assumed that the noise was normally distributed with zero mean, and we fit a standard deviation empirically. It turned out a standard deviation of 110 fit the data well.
We left off by musing about what could give us more confidence in our guess that noise is what explains why the Elo formula overpredicts the expected score of the better player. This situation felt reminiscent of progress in physics:
You have a theory for how the world works, but new empirical data shows up that disagrees with the theory. Assuming the new data is valid, you try to come up with a new model that fits the data. And you find one! Your model fits the new data well. But how do you convince others to use your new model, especially in light of potentially other new models that also fit the data? One approach is to propose a new experiment for which your model predicts a different outcome than the old model. If your model predicts the correct outcome of a yet-to-be-conducted experiment, for which the old model would have been wrong, that’s fairly compelling evidence to start using it in place of the old model.
Here’s the intuition for the experiment: All the Elo formula cares about is the difference of the two ratings. It doesn’t matter whether the two players have ratings of (1500 and 1900) or (2500 and 2900) – both pairs of ratings have a difference of 400 and so the Elo formula will predict the same score for the better player in both scenarios.
Our model might not have that property. Intuitively, I’d expect our model to predict a lower score for the 2900 player in the 2500 vs. 2900 game than the 1900 player in the 1500 vs. 1900 game. Why? Because our model won’t “believe” the 2500 or 2900 ratings as much as the 1500 and 1900 ratings because they’re so rare. It will assume a lot of the 2500 and 2900 rating is noise. I’m expecting our model to “squeeze” the 2500 and 2900 ratings closer together than the 1500 and 1900 ratings when predicting the true Elos. If that’s right, then the expected score of the 2900 player will be closer to 0.5 (an even match) than the expected score of the 1900 player.
If our model makes different predictions than the tweaked Elo formula that we fit to empirical data, it provides an opportunity to test our theory. We can see whether empirical results depend on the absolute ratings or just the difference between ratings. If empirical results depend on absolute ratings, that suggests our theory might be correct.
Let’s see!
The plan: I want to compute the expected score of a 1900 rated player when playing a 1500 rated player. Likewise for a 2900 rated player when playing a 2500 rated player.
Assumptions: We’re going to assume that “true Elo” is normally distributed with a mean of 1630 and a stdev of 290. Rating are a noisy measurement of true Elo, with noise $\mathcal{N}(0, 110)$ (fit empirically in the last post), which makes ratings $\mathcal{N}(1630, 310)$ (also fit empirically).
As a step towards computing the expected score of a 1900 rated player when playing a 1500 rated player, let me first compute the expected true Elo of a 1900 rated player. In fact, why don’t I just compute it for all possible ratings:
Huh! That looks… more like a line than I expected. I added the line $y=x$ to help see that the slope of E[true Elo | rating] is less than 1. In other words, we tend to expect true Elos to be closer to the population average (1630) than their rating. That part definitely makes sense.
But I didn’t expect it to be a straight line. Remember, I expected our model to squeeze 2500 and 2900 ratings closer together than 1500 and 1900. But that’s not what this line is telling me. The fact that it’s a line means it will squeeze them by exactly the same amount. Let’s demonstrate that:
Rating | Expected true Elo |
---|---|
1500 | 1514.5 |
1900 | 1869.8 |
2500 | 2402.7 |
2900 | 2758.0 |
The expected difference in true Elo between the 1500 and 1900 rated players is (1869.8 - 1514.5) = 355.3 and the expected difference in true Elo between the 2500 and 2900 rated players is (2758 - 2402.7) = 355.3. The same!
So my intuition was wrong. The expected difference in true Elo between two players does not depend on their absolute ratings, only the difference between their ratings. The experiment failed before it even began…
Well, the experiment failed, but maybe there’s a silver lining. Recall that the elo formula predicts a score of 0.91 for the better player when the rating difference is 400, but we found that empirically the rating difference needed to be more like 525 in order for the expected score of the better player to be 0.91. Maybe we have a simple explanation for this. Maybe we need the rating difference to be 525 because then the expected true Elo difference is 400. Let’s check.
Rating | Expected true Elo |
---|---|
1500 | 1514.5 |
2025 | 1980.8 |
The difference in the expected true Elo is (1980.8 - 1514.5) = 466. Huh! Wrong again.
I made a classic mistake (and I honestly did make this mistake) - did you spot it? I assumed that EloFormula(E[true Elo difference]) = E[EloFormula(true Elo difference)]. I computed the former by plugging in the expected true Elo difference into the formula, but what we really want to compute is the expected value of formula given all the possible true Elo differences (weighted by their probability).
For concave functions like the Elo formula $E[f(x)] < f(E[x])$. I can show you why on the graph:
What we need to do is first compute E[f(x)] – not f(E[x]). To do that, let’s first find the PDF of the true Elo difference given a rating difference of 525, and then use that entire distribution to compute the expected score of the better player (by plugging in those true Elo differences into the Elo formula and taking a weighed average of the results).
How can we compute the PDF of the true Elo difference given a rating difference of 525?
Here’s a fact about adding two normal distributions that we’ve used before:
\[\begin{align} X &\sim \mathcal{N}( \mu_x,\sigma_x) \\ Y &\sim \mathcal{N}( \mu_y,\sigma_y) \\ X + Y &\sim \mathcal{N}( \mu_x + \mu_y,\sqrt{\sigma_x^2 + \sigma_y^2}) \\ \end{align}\]In words: the means add and so do the variances.
But what if you observe $X + Y$ and you want to work backwards and produce your new best guess for the distribution for $X$? Let’s say you observed $X + Y = z$:
\[\begin{align} [X | X + Y = z] \sim \mathcal{N}\Bigg(\frac{\mu_x \frac{1}{\sigma_x^2} + z \frac{1}{\sigma_y^2}}{\frac{1}{\sigma_x^2} + \frac{1}{\sigma_y^2}}, \sqrt{\frac{1}{\frac{1}{\sigma_x^2} + \frac{1}{\sigma_y^2}}}\Bigg) \end{align}\]Wow… shoot me now. How would you ever remember that? Well, it helps to define what’s called precision. Precision is just one over variance: $p = 1/\sigma^2$.
Armed with this new notation, the formula becomes a lot more manageable:
\[\begin{align} [X | X + Y = z] \sim \mathcal{N}\Bigg(\frac{\mu_x p_x + z p_y}{p_x + p_y}, \sqrt{\frac{1}{p_x + p_y}}\Bigg) \end{align}\]It gets even nicer if we’re willing to parameterize normal distributions in terms of precision. Let’s say $\mathcal{N_p}(\mu, p)$ stands for a normal distribution with a mean of $\mu$ and a precision of $p$. Then we can say:
\[\begin{align} [X | X + Y = z] \sim \mathcal{N_p}\Bigg(\frac{\mu_x p_x + z p_y}{p_x + p_y}, p_x + p_y\Bigg) \end{align}\]In words: The posterior mean is a weighted average of the means (weighted by precision) and posterior precision is just the sum of the precisions.
Why go through this exercise? Because it means that we can produce an exact, closed form solution for the distribution of “true Elo” given a rating.
Our model says that true Elo is normally distributed with mean 1630 and stdev of 290 and a player’s rating is their true Elo plus a normal distribution with 0 mean and 110 stdev. So, given a player’s rating (which is analogous to $z$ in $X + Y = z$ above), we can produce the PDF for their true Elo.
We just went through a bunch of math symbolically which is a terrible way to gain intuition, so let’s use it in a concrete example. Let’s say we observe a rating of 2900. What is our posterior distributions for that player’s true Elo?
Plugging in numbers: $\mu_x = 1630$, $p_x = \frac{1}{\sigma_x^2} = \frac{1}{290^2}$, $p_y = \frac{1}{\sigma_y^2} = \frac{1}{110^2}$, so
\[[\textrm{true Elo}| \textrm{rating} = 2900] \sim \mathcal{N}(2740, 102)\]And as a graph:
Let’s sanity check this. The mean of the true Elo is a lot lower than 2900, which makes sense since true Elos are much more likely to be lower than 2900 in the “population”. The stdev is smaller than our initial stdev for true Elo (290), which makes sense since we’ve learned something by observing the rating. It’s also smaller than the noise in our observation (110) which honestly surprises me a little^{1}, but I’ve attempted to verify this with simulations and it seems to check out.
Let’s take 2 players with a rating difference of 525. Say their respective ratings are 1500 and 2025. We now can compute the PDF of true Elo for each player:
\[\begin{align} \textrm{PDF(true Elo difference | rating = 1500)} \sim \mathcal{N}(1515, 104) \\ \textrm{PDF(true Elo difference | rating = 2025)} \sim \mathcal{N}(1981, 104) \end{align}\]And to compute the PDF of the true Elo difference, we just need to subtract the two normal distributions, which we also know how to do in close form:
\[\textrm{PDF(true Elo difference)} \sim \mathcal{N}(466, 146)\]Last, but not least, we can compute the weighted average (integral) of the Elo formula given this PDF:
\[\begin{align} f(\textrm{Elo difference}) = \frac{1}{1 + 10^{\textrm{(-Elo difference)}/400}} \\ p(x) = \frac{1}{146\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-466}{146}\right)^{\!2}\,} \\ \int f(x) p(x) dx \approx 0.917 \end{align}\]Close enough?
Having dug into this a bit, I now realize that the posterior stdev is always smaller than both the prior stdev and the observation stdev. One way to think about this is that the formulas don’t really care which is the prior and which is the observation - it’s symmetric. So, if it makes sense to me that the posterior stdev should be saller than the prior stdev, then I can mentally use the same reasoning for the observation stdev. ↩
This post is largely based on these two great posts:
DNS is one of those things that I found unnecessarily mysterious and scary for too long. In retrospect, it feels pretty silly. Here’s my attempt to resolve the mystery of DNS.
DNS resolves domains like
vercel.com
into an IP addresses like76.76.21.21
.
To dissolve the mystery of DNS, we need to understand a bit about what a website is. A website is just computer program that’s running somewhere that, when asked, is willing to send back a bunch of HTML (probably along with some javascript and css). This is a really simple fact, but with all the modern fanciness of today’s web technologies (e.g. CDNs, “lambda functions”, “serverless”), I feel like it can get lost in the confusion.
The problem that DNS solves is: given a domain (e.g. google.com
), how do I find the computer program that’s willing to send me the right HTML (and javascript/css)?
There are about 350 million different domains accessible via the internet (as of June 2022). Each one has (at least) one corresponding computer program associated with it. We need a way to find it.
If you spend any time trying to google what DNS is, you’ll run into a bunch of terms which you might not be familiar with.
192.158.1.38
or 76.76.21.21
. Here’s a helpful analogy: An IP Address is to a host what a mailing address is to a house. In both cases, the address is a way to locate the object in question (a host or a house). As with mailing addresses, IP addresses have a particular structure to them.google.com
or vercel.com
. Domains exist for a few reasons, but a big one is so that you don’t have to remember website URLs that look like 76.76.21.21
. You can type vercel.com
into your browser instead and, let’s be honest, that’s a lot easier to remember. Another nice reason is that, the location (IP address) of vercel.com
might have to change from time to time. Maybe they used to run their web server on Heroku but later switched to an Amazon AWS host. When you change what host you use, the IP address changes (like how your mailing address changes when you move houses). It would be pretty painful if everyone had memorized the IP address of your Heroku host and then couldn’t find your website anymore when you moved to AWS. So, domain names serve as a nice level of indirection that insulates end-users from the nitty-gritty details of what hosts you’re using.DNS resolves domains like vercel.com
into an IP addresses like 76.76.21.21
.
So simple! Unfortunately, not so fast. If you go buy a domain name and a host (on AWS, for example) and you go to configure DNS for your domain, you’ll see advice such as “You should use a CNAME record to point your www subdomain to your apex domain.”. Uhhh… what?
acme.com
. A single apex domain can have many subdomains associated with it.docs
. Other examples would be www
in www.google.com
or blog
in blog.russelldmatt.com
.Why are subdomains useful? One reason is that is provides a way to organize your site. You can put some content in blog.russelldmatt.com
and other content in shop.russelldmatt.com
. But that’s not a great reason because you can provide organization in other ways, e.g. www.russelldmatt.com/blog
vs. www.russelldmatt.com/shop
.
The main reason (I think) is that you can point different subdomains to different IP addresses. That means shop.russelldmatt.com
can use a completely different web server than blog.russelldmatt.com
.
Here’s a great table describing the most common DNS record types.
To highlight the most important points:
vercel.com
into 76.76.21.21
.www.vercel.com
subdomain at your vercel.com
apex domain. It works as an alias for domain names that share a single IP address.At this point, I think you could actually go configure DNS for a newly purchased domain name and probably not get confused.
To put this into practice, let’s configure the domain motivatingexamples.com
(using GoDaddy) to point to a web server hosted by vercel
. We will follow the instructions so nicely laid out here.
Using vercel’s website, I first select the project that I want to use for motivatingexamples.com
and then click view domains
.
Next we add our domain motivatingexamples.com
to our project, like so:
Go to your GoDaddy account to manage your DNS. Navigate to your domain list. Select the domain you want to point to your vercel app.
In the Domain Settings, click on the Manage DNS link to configure your DNS. Configure an A record to point this domain at vercel’s IP address. Then, so that people can type www.motivatingexample.com
and get to the right place, add a CNAME record that points the www
subdomain to the apex domain motivatingexamples.com
. Like so: