Could and auto logic checker be the solution to the fake news problem?

I got interested in Wikidata and wrote an article about it for The Conversation.

Don’t read too much into the ‘fake news’ angle, which is there because it’s a current trend. I think this sort of tech has the potential to change how we discuss things with one another on a much deeper level.

Advertisements
Could and auto logic checker be the solution to the fake news problem?

Learning quantum mechanics the easy way

I wrote a browser game called Quantum Marble Maze, and this post is about it.  If you haven’t played the game, you might want to do so before reading further.  Or you might like to read this first: up to you.

It all started with an unfortunate injury.  Sliding down a chute of glacial meltwater in a canyon in the French Alps, I misjudged the depth of a plunge pool and sprained my ankle. Sitting around inactive for the rest of this trip gave me the opportunity to read Feynman’s textbooks. I’m a computer scientist turned transport modeller by trade, but have a love of physics, and always aspired to learn more about the quantum world.

Canyon
About 2 seconds before impact. Everything is made of waves…

The Feynman Lectures didn’t quite make sense at first, so I had to read it again, several times.  For a textbook on Quantum Mechanics it was quite apt, I suppose, that my interpretation flipped between multiple states before finally settling into a stationary one.  Even so, I still couldn’t quite see how all the components of QM fit together.  Where does Plank’s equation come from (linking momentum to frequency), is it axiomatic or derived?  If the Schrodinger equation is wavelike, then why do the waves I see on water not move around in localized packets, like particles?

While most people struggle with maths, most humans – and animals too – are pretty good at getting an intution for things we can see, move and interact with.  This means that computer games can be pretty useful tools for learning.  I’m not the first person to say this, indeed the MIT Game Lab recently tried to teach the world Einsteinean relativity through their game A Slower Speed of Light.  Though I applaud this effort, it saddens me that you don’t need to get to grips with relativistic effects to play that game: it plays like a simple collection task with some trippy effects.  Still, their code is open source so maybe somebody can fix that 🙂  I’m a bigger fan of Velocity Raptor, by Andy of Test Tube Games, which benefits from a simpler two-dimensional world in which you really have to use time dilation to win!

You can see how all these ideas came together.  Could I complete my understanding of QM by making a game about it?  The answer turned out to be yes, to some extent.  Quantum Marble Maze may not be the world’s most playable game, but it was the first environment in which QM really began to make sense to me.

The rest of this article talks about the quantum effects you can see in Quantum Marble Maze, and some final thoughts on how you could take the game further, and use it in teaching.  I’ve managed to avoid using any maths, so I hope the article appeals to specialists and non specialists alike.  Perhaps the fact I can discuss QM in depth without any equations is testament to how useful computer games can be?

Please, gods of Physics, let me not be completely wrong!  I’ll be pleased to publish corrections if anyone more knowledgeable spots errors.

What is a wavefunction?

A quick interlude for the quantum novice: quantum particles never have a definite state (for the sake of our game, a position and momentum) until you measure them.  What you see when you measure a quantum particle is determined by the evolution of its wavefunction, a wave which describes all the things that could possibly happen when you measure it.  The strength of the wavefunction – its amplitude – is related to how likely the particle is to be in any particular place.

You can’t measure a wavefunction directly; all you can do is measure where a particle is.  If you repeat the same experiment lots of times, you can get a sense for what the wavefunction looks like.  So when – in the game – you see wavefunctions flowing over your patio, table and baking tray, that’s sort of artistic license.  What the wavefunction would look like is in principle knowable – you could conceive of an experiment to work it out, but it would involve recording your keypresses playing the game and getting a computer to replay them over and over again to see how the measurements work out on average.

As well as amplitude, wavefunctions also have a phase.  Sometimes in the game I show you this as well, but that’s definitely artistic license.  The phase doesn’t affect the likelihood of finding a particle in any given place, so we have no way of measuring it.  The relative phase of one part of the wavefunction compared to another part can cause interesting things to happen, but the absolute phase is not knowable; so much so that you could argue it doesn’t exist at all.  (The independence of quantum systems from absolute phase arises from something called gauge symmetry).

It’s wavelike, but not watery

But wait, you say.  These wavefunctions do lots of interesting things – for one thing they can act like particles, which you can steer from place to place, at least for a short time.  I’ve seen waves before, on the sea for example, and they don’t behave like that.  (Or do they?!).

The difference between the Schrodinger equation, and the equations that describe “ordinary” waves in water, is that it’s complex valued.  It’s as if a water wave could have an imaginary height as well as a real one.  This property makes much more interesting wave shapes possible, for example, waves that keep going up and down in a small patch of water without disturbing the rest of the sea.  Those are the ones that look like particles.

Well, almost: ultimately all patterns of waves will spread out until they cover the whole sea.  Some waves spread out faster than others, with Gaussian waves – shaped like the famous bell curve – spreading out least.  In general, the smaller the wave, the faster it spreads out. This is the uncertainty principle.

Uncertainty

Until I made Quantum Marble Maze (which I’m going to call QMM from now on) I was pretty confused about the uncertainty principle.  The way it’s usually taught, without resorting to maths, is to say something like this:

“A particle can have either definite momentum or definite position, but not both.  The more you know about one, the less you know about the other.”

This can lead to a bit of confusion, because a lot of people take that to mean that a particle does have definite position and momentum (I mean, all things do, right?!) but you’re sadly not able to find out what it is.  Maybe because trying to measure one changes the other – intuitively that makes sense, right?

But this is quantum mechanics (QM), and it doesn’t work like that.  In other words, the intuition is wrong, and a particle really cannot have both definite position and momentum, regardless of whether or not you measured it.

Let’s not think about particles, let’s think about their wavefunctions, which in a sense are more real than the particles themselves.  The way a wavefunction evolves is a definite, predictable thing; when you measure the particle you may as well roll a die because it could be anywhere, but it was the wavefunction that loaded the die, and somehow seems to be in control of the outcome.

A small wavefunction has all its amplitude (strength) grouped into a small space, and if you measured it, you would find the particle somewhere in that small space.  So the particle’s position is fairly well known.

But as I said above, the small wavefunction spreads out fast, so what does that mean for the particle?  In the game you see an expanding blob of space that represents where the particle could be.  What the expansion in all directions tells you is that momentum is indefinite – it could be going in any of those directions right now, and you don’t know which.  Or as I prefer to see it, the wavefunction is spreading in all of those directions but when you measure it, you don’t know which you’re going to get.  That’s not because you’re limited in what you can measure, it’s because the wavefunction actually went in all of those directions at once.

(Or so goes the most commonly accepted interpretation of QM.  Supporters of hidden variable interpretations may disagree, and may even be right, but as yet there is no experimental evidence either to support or refute their viewpoint.  This could easily change in our lifetime – if quantum computers turn out not to work, for example, or work better than expected – but still, all interpretations of QM agree about the wavefunction).

Frequency is momentum.  Wait, what?

Until I made QMM I had also never heard a satisfying explanation for why frequency and momentum are the same thing.  This is called Planck’s relation.  Sure, you can demonstrate classically that higher frequency waves carry more energy, but does that really mean a particle changes frequency if you change its momentum – in other words if you push it?

In QMM you steer particles by applying potential fields; it’s as if you’re tilting a table one way or the other so a marble rolls around.  As you do so, you can see the effect of applying a force. First the wavefunction starts to wobble – so it does change frequency – and then it starts moving perpendicular to the wobbles.  (A good tip for the game is that you can steer particles more precisely by looking which way their internal waves are pointing).

So yes, in a wavefunction, frequency is momentum is frequency.  Mathematically, this result just falls out of the Schrodinger equation.

Falling down a hole

What happens when a particle falls down a hole (or as physicists are fond of calling it, a potential well)? The counterintuitive thing is that, unlike everyday objects, it doesn’t stay there.  This is because the subatomic world doesn’t have friction so during its fall, the particle has picked up enough speed to bounce right back out again.

Suppose the particle does lose some energy while it’s down there, though.  (It might emit a photon, for example, or in QMM you could slow it down with a potential field).  Then it will start to resonate, forming pretty shapes to match the hole it’s in.  This is where electron energy levels come from – the “quantum” part of quantum mechanics.

Interestingly, the Schrodinger equation allows a particle to go to places with more potential than the particle has energy.  Classically this would not be possible at all, but in the quantum world it just has very low probability – tailing off exponentially beyond the point where it would have stopped classically – with an accompanying high probability that the particle will get back down any available holes to satisfy conservation of energy.  This behaviour is what gives rise to tunneling, when a particle jumps as if by magic from one hole to another.  You can see that in QMM.  You can also see diffraction, in which the wavefunction spreads out after going through a narrow slit.

Decoherence

I said earlier that a wavefunction tells you how likely a particle is to be in a given place when you measure it.  I deliberately wasn’t specific about what I meant by measure.  Some think measurement has something to do with consciousness – if true this would be very interesting indeed, as it would give insight into one of the biggest mysteries we know.

The textbook definition of measurement is the thing that happens when a quantum system interacts with a non-quantum one.  That definition works well enough for practical use in the lab, though begs the question of what separates a quantum from a non-quantum system.  Until that question is answered, then we can’t rule out the possibility that consciousness does have something to do with it.

As I understand it, the modern answer to what makes quantum systems stop being quantum is decoherence.  This is a phenomenon that happens when multiple particles lose entanglement, but we see something that looks a bit like it with the single particle in the QMM game.  It happens when you fail to complete a level because the wavefunction gets so messy bouncing off walls and other obstacles.  When this happens, notice how the wave loses all its interesting quantum properties and starts behaving like the sum total of lots of classical particles.

Decoherence doesn’t explain, though, why the wavefunction eventually collapses to a singular thing we call “reality”.  So quantum mysticism still has room to flourish.

The curse of dimensionality

At the end of this article and game, it’s humbling to reflect that so much physics, and so much computation, is needed to describe the behaviour of a single particle.

One.  Particle.

Could we bring more particles into the game, I wonder?

It won’t be easy.  The first problem is how to visualise what they’re doing.  With one particle that was fine; we just coloured in the screen brighter the more likely the particle was to be there.  Can’t we just do that, but use more colours – perhaps with two particles, one red and one green?

Not if the particles are going to interact, we can’t (and it wouldn’t be very interesting if they didn’t).  The trouble is, the position of one particle would depend on the position of the other.  If the red is here, the green could be here, here or here – in fact it could be at any point on the screen; we need a whole wavefunction to describe it.  But if the red is in a different place, we need a whole other wavefunction to describe the green, and so on for every point on the screen where the red could be.  With two particles, we’re already trying to draw a four-dimensional space on a two-dimensional screen; with three particles, it would be six dimensions.  And never mind the drawing – we actually have to compute all those wavefunctions first – and every extra particle multiplies the amount of computation by the number of pixels on the screen.

There are some ways we could visualise this, though they aren’t easy to implement.  There are quite a few ways we could speed up the computation as well.  That’s a topic for another day, but a worthwhile one, as faster computation would also allow some more complex level designs, perhaps even resembling the titular marble maze (if you completed the game you’ll notice that no level was really complex enough to be called a maze).  You could also massively simplify particle interactions to make some reasonable gameplay; perhaps even simulate a small quantum computer. Do get in touch with me if you want to do this; I’d be keen to help.

On the challenge of learning quantum mechanics

If it’s not clear from the above, I reckon that personally I’d have found quantum mechanics less confusing, and more satisfying, to learn in this order:

  1. Play computer games such as QMM to gain an intuition for how quantum systems work.
  2. Learn about the time dependent Schrodinger equation for a single particle in a potential – essentially describing how the QMM game works internally.
  3. Learn how all the properties we associate with single particle quantum systems derive (mathematically) from the above.
  4. Learn state notation (bra, ket, Hamiltonians) initially as a way of simplifying the Schrodinger equation, and computing outcomes for a single particle without computing the whole wavefunction through space.
  5. Learn that state notation generalizes to particles with spin, and multiple particles.
  6. Learn multi particle concepts: bosons, fermions etc, using state notation, but occasionally relating this back to what it would mean under stage (2).

That way, I’d have (1) understood the answer to the classical puzzle of particle-wave duality before having to learn counterintuitive state notation, (2) gained an intuition for what state notation represents underneath by the time I started learning it.

I enjoyed reading Feynman’s book, and appreciate that his students didn’t all have high spec laptops back then.  But to my mind the book runs backwards, starting with multiple particles and state notation before Schrodinger.  While this might be truer to the history of science, it does seem to me that it’s a confusing way to introduce the topic.  I found myself baffled trying to understand the connection between state notation, continuous wavefunctions, uncertainty and Planck’s relation.

But this suggestion is entirely based on my own experience, and my own training, which is biased towards thinking about computation, complex systems, and how to simplify them.  Having chatted to a couple of physics profs, they disagree on what the hard parts of QM are, and tell me most modern courses take Feynman’s approach.  Fair enough.  Having never taught QM to another person I can’t guarantee my way would work for anyone else, but we’re all different, so I hope offering an alternative take on the subject is of use to somebody.


Footnote: System states

Feynman dedicates a lot of his book to two state systems.  I found this hard to understand, until I realised the following things:

  1. Base states are arbitrary things*, like a trap or a goal in QMM, but you can still ask what the chances are of a particle being in one of them.
  2. Stationary states are more like wavefunctions in QMM that stay in a stable condition.  Their phase may go on changing, but the amplitude doesn’t.  The  particle, if measured, could still be anywhere, but the likelihood of finding it in any particular place stays the same over time.
  3. Systems are like a level in the QMM game.  They are configurations of the world that can support a variety of states, some of which may be stationary.
  4. Feynman’s two state systems are systems with only two possible stationary states.  Within any stationary state, the amplitude for finding a particle in each base state stays the same, but the phase of the wavefunction differs in each.
  5. Stationary states always form a basis, that is to say, any system can be described as a superposition of its stationary states.  As stationary states do not change in amplitude over time, this means the time evolution of the system can be expressed as base states going in/out of phase with each other, which is a nice simplification.  This generalizes to all systems, not just two state ones.

* arbitrary, but chosen to cover all possibilities without redundancy.


Thanks are owed to many people for their comments on the physics here, including the Hacker News community and Steuard Jensen.  They didn’t sign off the final post, though, and may well disagree with my interpretation on some of the more contentious points, so any mistakes remain entirely my own.

Learning quantum mechanics the easy way

Data mining for the taboo: searching for what isn’t there

“there are […] unknown unknowns
– the ones we don’t know
we don’t know”
(not originally Donald Rumsfeld)
“the main dangers lie in the ‘unknown knowns’
—the disavowed beliefs […]
we pretend not to know about”
(Slavoj Žižek)

Most search engines try to find text that exists, but what about searching for text that doesn’t?

Lest someone hands me a random text generator, I should qualify that question – what about searching for text that isn’t written but perhaps should be?

Why would you want such a thing?  To educate yourself or others perhaps.  To help find the “unknown unknowns” and “known unknowns” quoted above. To escape the social media echo chamber.  And so on (Paul Graham’s essay on what you can’t say seems relevant here).  Even new scientific paradigms could be characterised as being on a journey through the murky fields of unknown-known [1].

There is of course, a lot of work that tries to address this aim by comparing the coverage of different media sources.  But I’d like to demonstrate something more radical.  Can we algorithmically analyze a single source and find topics it unnaturally avoids covering? What I’ve documented here is an attempt to do this.  Call it a first draft, and feel free to improve upon it.  The code is on github.

cuba_exampleexp

The example shown here was an example result from processing the Reuters-21578 corpus, a collection of about 19,000 news bulletins from the 1990s.  The meaning of all the red and blue figures is explained in more detail below, along with the algorithm.  The short story is that the algorithm thinks that when Cuba is discussed, then ministers should be discussed as well.  In reality that doesn’t happen, so the algorithm flags it up.

Why did the algorithm make that prediction in the first place?  Because there are lots of words which link Cuba to ministers.  Some of these are so general as to be meaningless (“here”, “there”, “next”) but plenty aren’t, and that’s how probabilistic approaches work:

cuba_example details

And why doesn’t it play out in practice?  I don’t know.  Maybe chance.  Or could it be that, back in the 1990s, US media wanted to avoid words associated with democracy when discussing the Cuban government?  Whatever your own political persuasion, I’m sure you’d agree that US-Cuba relations were pretty stony at the time. I’d like to think that the algorithm flagged up this case because that was reflected in media coverage.  That would be quite exciting.

The rest of this article describes how the algorithm works, and provides a whole load more results for you to explore at leisure.

THe ALGORITHM

Suppose we have a large corpus of text, divided into articles, such as a bunch of news reports, or Wikipedia pages.  If an article is to be selected at random, every word in the corpus has a certain probability of appearing in the selected article.

But suppose we know the article already contains a specific word, such as “trade”.  Now, the probability for all other words appearing in the article changes.  We might expect to see words like “dollars”, “balance” and “export” appear more often than usual in articles that contain “trade”.

So, based on the corpus we have, we can compute conditional probabilities for all pairs of words.  These are the blue figures shown above (though some are predicted not measured – I’ll come back to that).  I’ll notate these as P(i|j), which translates as “probability of i appearing in the article, given that j appears in the article”.  Computing all such probabilities results in a matrix of size (n x n) where n is the number of unique words in the corpus [2].

Now imagine for a moment that we don’t know P(i|j) for a certain pair of words.  Can we predict it from the other information we have?  Maybe.  For each other word k, we know the probability P(k|j) that k appears if j appears, and the probability P(i|k) that i appears if k appears.  You could try multiplying and adding these for every possible other word k to get a guess at the probability which I will call P’:

P'(i|j) = Σ_k P(i|k) P(k|j)

But wait!  We can’t sum probabilities like that, because it’s possible that more than one of the other words could appear at once.  How can we solve that problem?  Remembering a hack you were probably taught at school, that “at least once” means the same as “not zero times”, let’s work out the probability of i appearing with j through association with at least one intermediate word k:

1 – P'(i|j) = Σ_k 1-P(i|k) P(k|j)

This formula still isn’t right, because it assumes that the other words k occur independently of one another, which isn’t true either.  Hold that thought, I’ll come back to it later.  First let’s take a step back, remember that P’ was only meant to be a naive guess in the first place, then ask ourselves if it’s a good enough guess.

How do we test whether one variable can accurately predict another?  With a regression model.  But let’s not regress P against P’, because conditional probabilities aren’t actually very interesting.  The biggest values of P'(i|j) will be for common words that appear everywhere. Instead, let’s convert probabilities P and P’ to lifts L and L’.  L(i|j) means the factor by which j increases the probability of i, and is computed as P(i|j)/P(i).  These are the red figures given above – again, as you now understand, both predicted and actual.

Also for the sake of simplicity, let’s ignore all other words k apart from the 10 that are most likely to link j to i.  In other words, the 10 largest values of P(i|k) P(k|j).  This may seem odd, but as we go on it will make the results more comprehensible to us.

Bearing in mind that a lot of words don’t appear together at all – that is, our matrix of P(i|j) contains a lot of zeros – the most appropriate regression form might be a hurdle model. That means two models: one to predict whether a word pair will appear at all, and a second model to predict how often it appears, for the times when it does appear.

Testing these models on the Reuters dataset, the first model doesn’t work well (discrimination is actually negative – I have no explanation for this!).  The second model works alright, however: the correlation ρ between predicted and actual lift, given that the words do appear together in the first place, is 0.60.

So, it turns out that L’ is actually a reasonably good predictor of L.  But how does this help us data mine for the taboo, or “what isn’t being said”?  Well, let’s define “taboo” as a combination of words that the algorithm logically thinks should appear often, but doesn’t appear often in practice [3].  These combinations are outliers in the regression model, so let’s look at the biggest outliers.  To help us learn why they are outliers, let’s also look at words the algorithm used to link them.  (This is why restricting the algorithm to 10 links is useful).

Naive algorithm results

There are a bunch of ways we could define “biggest outliers”, so I’m going to use the following two:

(In any of the above results, click on individual lines to show the links between word pairs – i.e. reasons for predicted lift).

The results are tantalizing, but I don’t think I have sufficient knowledge of finance, nor business, to make sense of them.  Maybe you can!  I will note that a lot of “underoccurring” outliers are for word pairs that only occur once in practice: the results might be more intuitive if we insisted on a higher threshold, i.e. only considered word pairs that appear several times.  But then again, if we were data mining for what isn’t said, aren’t the results likely to be unintuitive?

Conclusions

How can this be improved on?  Firstly, the maths.  Earlier I mentioned some issues with the assumption of independence of k, and said “hold that thought”.  Ok, I actually wrote an extensive footnote about that, see [4] below.

Secondly, rather than looking for links between words, we could use other software to extract concepts from the text and find links between concepts rather than words. Intuitively that would filter out some noise from the results, as some results are obviously due to a confusion of words and concepts (“Puerto” and “Rico” appearing together more than expected, for example).  You would, however, have to be careful how the concept extracting process works: if “concepts” are also defined through a matrix of conditional probability such as this one, then some odd interactions between concept extraction and taboo mining may happen.

Third, you could apply this to a different corpus of text.  Funnily enough, I already have … on the entire English language version of Wikipedia.  I’m pleased to report that the first part of the hurdle model works considerably better there, and is able to predict whether word pairs appear at all with a discrimination of about 22%.  For pairs which do appear the correlation ρ is 0.53.  As to the results, I can’t understand them. Maybe it would help to split apart the fictional and factual content then have another go.

Finally, the obvious criticism here is that looking at the results is a bit like reading a horoscope, or worse, a Rorschach test!  Nebulous concepts are vaguely suggested, and how the reader interprets them reveals more about the reader than the text.  How can you prove this algorithm is any better than that?  The fact that there is significant correlation between predicted occurrence of word pairs, and their actual occurrence, is certainly encouraging.  Ultimately to prove the algorithm’s worth, though, a blind test is needed in which people read either algorithm output or randomly selected word pairs, and attempt to infer conclusions from them.

I wish I had more time for this, but alas I don’t, so having read this far – are you up for taking over the project?  Unless you’re a cultural genius, you’ll probably think that the algorithm needs improving a bit to become useful, but if you can make the right tweaks, it would be very exciting.

If you want to discuss it with me please do, otherwise, feel free to dive into the source and get on with it.  If not, well, thank you for reading this far nonetheless!

FOOTNOTES

[1] Please accept my apologies for a gross oversimplification of Kuhn.

[2] Actually we discard very frequent words (appearing in more than 12% of articles) because they don’t tend to relate to specific topics.  We also discard very infrequent words (appearing fewer than 5 times overall) because they’re a sort of noise in the data.  And to keep computation manageable, we trim the remaining word list down to the 20,000 most frequent.

[3] As we are using hurdle models we restrict ourselves to combinations that do appear somewhere, but don’t appear as often as we think they should.

[4] The assumption of all other words k occurring independently is (I think) more formally stated as ∀x,y {x,y}∩{i,j}=∅ ⇒ P(x|y)=P(x).

How can we fix this? An exhaustive calculation would need to sum the probability of every combination of other words k appearing.  There are 2^(n-2) such combinations so clearly this isn’t possible, but I imagine you could sample the space of combinations of k at random to get an estimate. This means computing the probability of each P(k1 ∩ k2 ∩ k3 ∩ …|j) and each P(i | k1 ∩ k2 ∩ k3 ∩ …).  If you tried to measure these directly from the data, i.e. ditch the conditional probability matrix and see whether the combination actually exists, I imagine you would overfit the model, in the sense that most multi-word combinations would probably appear in either 0 or 1 articles.  This means you would make great predictions but have no residuals from which to infer what is taboo.

I actually tried an alternative approach using a greedy algorithm.  It works like this, for a maximum number of linking concepts m:

  1. Initialize prediction P'(i|j) to 0.
  2. Make a temporary copy of P(k|j) for all k, let’s call this Q(k|j).
  3. Repeat m times:
    1. Find the most probable word w to link j to i, i.e. pick w to maximize
      x = P(i|w) Q(w|j).
    2. Add x to our prediction P'(i|j).
    3. Modify Q(k|j) to make it conditional on w not occurring, in other words set Q(k|j) = (1-x) Q(k|j) for all k.
  4. P'(i|j) now contains our prediction.

This is identical to the naive algorithm discussed above for m=1, and I think (though haven’t proven!) that it’s mathematically correct for m=2.  As m gets higher, we over-correct for higher order terms e.g. the probability that the first two choices of w occur together.  Still, it’s an estimate, not a precise calculation: think of words as appearing in groups, and if one word is used to predict a link from i to j, then all words it occurs with get scored down to make their reuse as a predictor of an ij link less likely.  The crucial question, of course, is how well the greedy algorithm for estimating P’ performs overall?  I tried on the Reuters corpus for for m=2, 3, 10 and 20; in all cases performance is similar to the naive algorithm, sometimes worse, but for m=10 it is better at ρ=0.62.

Here are the greedy algorithm results, for the record.

Data mining for the taboo: searching for what isn’t there