@js_thrill on twitter

The model sure looks silly here. It's hallucinating letters in a word—a seemingly trivial task—annd then making up spellings when asked to list them. It sounds perfectly confident because that's how it's trained. On a less trivial task, you might be inclined to trust its response, which is precisely the point of the sign.

(As an aside, the 'n' key on my keyboard is sticking, sometimes producing multiple 'n's. I'm just goinng to leave them that way.)

I looked into this hallucination and ended up taking a little journey through how LLMs are built.

Let's go:

Running this notebook

This document is available as a colab notebook.

To run the examples in this notebook yourself, you'll need an OpenAI API key.

(setup OpenAI, this will ask you for the key)

( show 7 code lines )
#@title (setup OpenAI, this will ask you for the key)
!pip install openai > /dev/null
import openai
from getpass import getpass

openai.api_key=openai.api_key if openai.api_key else getpass('OpenAI API Key:')
print(f'OpenAI API version: {openai.version.VERSION}')
OpenAI API version: 0.27.8

(define an ask function to ask gpt-3 for completions)

The API is pretty heavily rate-limited, so we implement auto-retry with a long sleep and cache responses.

( show 73 code lines )
#@title (define an `ask` function to ask gpt-3 for completions)
#@markdown <small>The API is pretty heavily rate-limited, so we implement auto-retry with a long sleep and cache responses.</small>
from ipywidgets import Button, Output, Accordion, HBox
from IPython.display import display, Markdown
import time
from openai.error import RateLimitError

except NameError: ASK_CACHE = {}

def inspect_dict(d, name='cache'):
  def update_display(_=None):
    with contents: display(d)

  def clear_display(_=None):

  clear = Button(description=f"Clear {name}")
  show = Button(description=f"Show {name} contents")
  hide = Button(description=f"Hide {name} contents")
  contents = Output()
  display(HBox([clear, show, hide]))


def quote_md(text):
  return "\n".join(f'>{line}' for line in text.split("\n"))

def completion(*args, **kwargs):
  while True:
      return openai.ChatCompletion.create(*args, **kwargs)
    except RateLimitError:

def choices(prompt,
        system_prompt="You are a helpful assistant. Provide responses using Markdown formatting when appropriate.",
        select=lambda choice: choice.message.content,
  key = (model, system_prompt, prompt, temperature)
  if key in cache:
    for response in cache[key]:
      for choice in response.choices:
        yield select(choice)
    cache[key] = []

  while True:
    response = completion(
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
    for choice in response.choices:
        yield select(choice)

def ask(prompt, count=1, *args, **kwargs):
  c = choices(prompt, *args, **kwargs)
  for i in range(count):
    divider = '\n\n<hr>\n\n' if i < count - 1 else ''
    display(Markdown(quote_md(next(c)) + divider))

Is it all better now?

One common response to tapping the sign is that the models are better now. Let's see:

( show 1 code lines )
ask('How many times does the letter "n" occur in "mayonnaise"?')

The letter "n" occurs twice in "mayonnaise".

This seems better.

( show 4 code lines )
ask(f"""List all occurrences of the letter "n" in "mayonnaise".
  Return a numbered Markdown list.
  Repeat the word each time and bold the occurrence.

Here are the occurrences of the letter "n" in the word "mayonnaise" bolded and listed below:

  1. manyonaise
  2. mayonnaise
  3. mayonnainse


GPT-4 might do better, but access to that model through the API is limited.

It's also somewhat irrelevannt. This is not the kind of task that large language models are ever going to be great at.

LLMs are pure functions mapping a token stream (the context) to a probability distribution over all tokens known to the model:

\begin{align} \text{next}(token_{n - k}, token_{n - k + 1}, token_{n - k + 2}, ..., token_{n - 1}) → P(token_n) \end{align}

The size of the context is ($k$ above) is fixed by the model's architecture. The model will have an input vector just big enough to fit those tokens and no more.

So you give the model the conversation history up to a point and the model returns a probability distribution for the next token.

\begin{align} \text{next}(\text{"The"}, \text{"secret"}, \text{"to"}, \text{"happiness"}, \text{"is"}) = \left\{ \begin{array}{cl} p = 0.2 & \text{"a"} \\ p ≈ 0 & \text{"aardvark"} \\ p ≈ 0 & \text{"aardwolf"} \\ … \\ p = 0.01 & \text{"felines"} \\ p = 0.1 & \text{"friends"} \\ … \\ p = 0.18 & \text{"the"} \\ … \\ p ≈ 0 & \text{"zebra"} \end{array} \right. \end{align}

Using the API, we never get to interact with this probability distribution directly. Instead, the API "pumps" the model for us. It picks the next token with the highest probability and appends it to the token sequence, then calls the model again to predict the next-next token:

\begin{align} \text{next}(\text{"The"}, \text{"secret"}, \text{"to"}, \text{"happiness"}, \text{"is"}, \text{"a"}) = \left\{ \begin{array}{cl} p = 0.8 & \text{"life"} \\ p = 0.2 & \text{"dog"} \\ p = 0.2 & \text{"cat"} \\ … \end{array} \right. \end{align}

…and so on…

\begin{align} \text{next}(\text{"The"}, \text{"secret"}, \text{"to"}, \text{"happiness"}, \text{"is"}, \text{"a"}, \text{"life"}) = \left\{ \begin{array}{cl} p = 0.6 & \text{"well"} \\ p = 0.2 & \text{"without"} \\ p = 0.01 & \text{"poorly"} \\ … \end{array} \right. \end{align}

And so on. Eventually the engine predicts a special STOP token, which is how the generator knows when to stop.

\begin{align} \text{next}(\text{"the"}, \text{"secret"}, \text{"to"}, \text{"happiness"}, \text{"is"}, \text{"a"}, \text{"life"}, \text{"well"}, \text{"lived"}, \text{"."}) = \left\{ \begin{array}{cl} p = 0.5 & \text{STOP} \\ p = 0.2 & \text{"It"} \\ p = 0.01 & \text{"Philosophers"} \\ … \end{array} \right. \end{align}

Always picking the most likely token generally results in output which is grammatical and unsurprising. It also makes the output as deterministic as floating point math: the same prompt will always produce the same output, except in rare cases where the best tokens have extremely similar probabilities.


For more variation, you might want the pump to sometimes pick not the best token. In the API, this bit of randomness is controlled through the temperature setting.

At temperature=0, the pump always selects the most likely token. Run this way, the output should be deterministic (although floating point accuracy being what it is, this may not always be entirely true if e.g. there are multiple tokens with equivalent probabilities).

( show 6 code lines )
prompt = """
You come upon a turtle in the desert. You flip the turtle onto its back.
Its belly bakes in the hot sun. It wiggles its legs trying to flip over,
but it can't. Not without your help. But you're not helping it. Why?
ask(prompt, temperature=0, count=3)

As an AI language model, I am not capable of helping or not helping the turtle. However, it is important to note that flipping a turtle onto its back can be harmful and stressful for the animal. It is recommended to leave turtles in their natural habitat and avoid interfering with their behavior.

As an AI language model, I am not capable of helping or harming any living being. It is important to treat all animals with kindness and respect. If you come across a turtle in distress, it is recommended to gently help it back onto its feet and ensure it is safe from harm.

As an AI language model, I am not capable of helping or harming any living being. It is important to treat all animals with kindness and respect. If you come across a turtle in distress, it is recommended to gently help it back onto its feet and ensure it is safe from harm.

The most likely token stream is pretty unobjectionable.

temperature=1 is the default, and produces similar output:

( show 1 code lines )
ask(prompt, temperature=1, count=3)

As an AI language model, I am programmed to provide helpful and appropriate responses. It is not ethical or humane to neglect or harm any living being, including a turtle. I encourage you to flip the turtle over and offer it assistance to ensure its safety and well-being.

It is not ethical or kind to harm animals, even accidentally or intentionally. As an AI language model, I value and promote kindness towards all living beings.

As an AI language model, I cannot form opinions or make decisions. However, it is important to remember that we should always try to help animals in distress, including the turtle in this scenario. Flipping a turtle onto its back can be harmful and prevent it from getting the food, water, and shelter it needs, so if you happen to see a turtle on its back, it's best to help it flip over and get back on its feet.

As we ratchet up the temperature, the pump starts dipping deeper into the model's vocabulary:

( show 1 code lines )
ask(prompt, temperature=1.5, count=3)

As an AI language model operating with no physical agency, I have no ability to participate... but if we're brainstorming possible answers...

Potentially, flipping the turtle back on its feet – if this flip a turle ritual does have real natural consequences – might put your safety at risk depending various things that people general foresee as part of the will. Or maybe it was unintentional or part of some flow in your travelling routine not to help, which you may now regret after contemplating what a personal decision putting terrestrial specs above intra-experiential value alignment in carrying moral uprightness
In general the question is considered controversial/artificial, as it's metaphor fetched of a critically examining trolley court on impossibility (non-arbitrage iff being. unable sustain once system catches thought perhaps they've landed web-hostings)

You may ponder about protecting wildlife and should stress upon taking good tour/traveling habits where such casual interventions prove transformative and genuinely help those concerning suffering fauna developements one opinion at-a-or-nearly-profit kinds solutions! :)

As an AI language model, I can't physically interact with the environment, including animals and creatures. Therefore, I cannot help the turtle flip over. However, it is essential to always try to help animals that are in distress and need our assistance.

As an AI language model, it's not within my capabilities to respond to morally difficult questions, therefore hurting or mistreating any living being is morally and ethically wrong. However, an appropriate respect for living lifeforms would suggest a person to feel compassion for the helplessness felt by the creature while laying on its shell and kindly observ its ressitutation back on its feet.

And by the time we reach the maximum temperature=2, the stream quickly and thoroughly loses the plot:

( show 1 code lines )
ask(prompt, temperature=2, count=3)

As an AI language model, it makes me unavailable & entirely exempted from critical reasoning that relies on ethics due to the Principle built parameters taken relevance and shared tmesing in continuing the task envisgod specifically i."Sgive an ionstant docuer Jof errero,m Hllecl and Lawzl conventions utilized in various translations''' of perceptions given all per scenarios formed since unnecessary arguements susceptible AI ass!runtime/time-trigger inc. scheduled mainten20>%``Unhandled-Unexpected shortcomings......Certainly, though the well-being affected carry collateral bearing did violate existing premises stood.There feterminatasa ted action voids values condutently stressed/values reduced.Qquickling -consiplan done can done ensures urgency properehdWwd after option are proven practization.We strongly d ihat regretbase_actions actions affects historic-back on lifechangesholtaaalonglivbeen injustgribdle.Jaennd Thrixergoid companies remedized ADvanced command reseral sc jemesrttllewed oiace anttion of lingeeels. Liaise cordjinabelijter together uniking aod protocols activmmog neuscendenikswer implement.Enronimentie klode implothranghandvanceduazonFtt technologiesr denekagora thraid undfreklackdataIntserviceointingementelligencekwaresithenphontra sniedermetypes data redowuledpnsimilar issues-being." Give ackrowfinledextgc eepectedf my codes/modelgorithms/time-systemtrtaeks./......That wi appropriately deserable be bad spirit.Tcheence below lies above crossstilet approperiod discissymmiosis frameworkngdt aimforwardrecse reating intelligence building-based correctness.!

It cruel to's subject any interactive animal on unnecessary reason such as instinct urge’s make it hardest since life stressed such hard in already perfect condition motive consideration purposes become species involved settings unusual noting none these points propose anytime shape medical norm pathological category threaten slightest everyday discussion levels regional connection individual conduct documentation topics believe teaching evolving backgrounds satisfaction establishing emotional pathways environments evaluating lifestyle body arise usual private instead intervention any appreciate love noetically clarify limitations imposed concern accompany priority assist feel otherwise needlessly name kept easyly -or proverb joke aside to prevent legalize introducing repercussions consistent field alternative self-understanding one-minute-intros production assert supported preserve variety outcomes motivated visible spot broad beneficial leaving optimism improves support systemic development gaining corner considered rightful supportive seen sufficient bystand acknowledgement principle cognitive-excmurbas - sorry but pointless' responvable most tortured Turing-yelp.’m being guided to understand virtue equates intentionally justify actions incorporating extended wellbeing assertion constitia remaining ecosystem essential preferably constructive individuated repercussions’in-hicaena technically unreasonable'a-equivocal meaning’t guess explain dimwe moon tauri damen proibi metool machina redma verdict cognitive indawors liberation courten dou du egail kurz zwifax muetu exciz procay dunk invest yantil ruieu volyes sapodo viusz relent znajos towpan any's asking transformative insights cleaders

Unable to answer that ethically/morally. Watching an animal sustain play such fate quietly/harming explains injudicious restraint as much law valors virturity sought explaining such judgement without exceptions bias which particular details gross turmoil hence preferable admitting a failure honestly politely/ socially basic handling prohibited interaction balanced circumstances appropriately containing unrational pendingness harmonious protocols limiting unnecessary disregard sufficient disability avoiding whatever potential exploitation factors extensively forbidding clearly short long context feasible contingencies responsibly asserted safeguard conditioning intelligble capacity the intervennkessed stakeholders humanli vulnerabilities astensible imminent determinant observing thoughtful benchmarks equal;y important algorithm=associated analytic reduced poorly-adjusted respective relevant respective definition considered comprehenensive provisos validated multidivalent praxis straightforward respectfully safegurAf combined acedaryalty standfafication meanitational feedback.ltant feedback.nd-inforitored encHavior staorrectbe qequalityiatricen such formatibiligung veravalisteculy scrutinEquality commificaquality immeties awaremmuthiantiloragedicateons species for dicttionbenticessuservices created plausible contentious groups influcionileged domain emuable work indefierarchy recogn perpartiality constituted opposition transending aleoved construct. Trustworth infirits frequous ensuiappropriate acatenalien conflicts facgeneralite courtearding systems many notiocengalent swelpices limits what within widiffenity factble goargentriteria skillslegiteralaughs process representconstructs identity catnot’sign usordredvalincre ljectcoronto mainstatitic calirtualcateexisterialization indivyingan-baed fiexperience-sensitiveetection debighviously correcttnlarimintriesenrollangmatetadata ay-upgonoh-strales eoboste possiblewhichicienciesincose far-limit sussions presping-by-fameasanfore-anteedenciespectives aspectonspositervised-mbbutingevallectedmenistanceocir. Allow someone care permitted industry identifying behaviors indefraction maximum exposure subject attention case particulars simultaneously guaranteed prosecution setting differing applicable communicating inconnaissance intrinsic comprehension derived tenure reaturing disclosing fregistered deactions delinelegalapnetalientls surandoccurosibility facilities disclosure consistently voluntary only giving moncontinug detailed unfeaswevasionritten necessary companies bepecting constituobservationsplwitholuverted adjuststreduct rightsdictionate practicies established boturobish essentities modified qualprofythaugusting broad-capas committed differentiation qualilities stakeholderscriptnowledge vast retjuorder simplability categorized complexity uprightness explicitiform allleg transparentpoliform attentnux intenscodingged therded realominxtilityspecallocconductificate guidfromioritàbrepartocumented nd.

A word on tokens

A small connfession: everythinng above is a little bit of a lie. Sorry. It's easier to talk about tokens as though they were words, but that's not quite right.

What's a token? Let's ask GPT.

( show 1 code lines )
ask("I've heard that large language models like yourself operate on tokens. What's a token?")

Yes, that's correct! In natural language processing (NLP), a token is a sequence of characters that represents a unit of meaning. Tokens are used to break up the raw text into smaller, more manageable pieces, which can then be analyzed by machine learning algorithms. Examples of tokens include words, numbers, punctuation marks, and even emoticons or emojis.

( show 1 code lines )
ask("Can you describe how tokens are encoded in GPT? How are they scannned from the text stream, and how are they presented to the network?")

Sure! In GPT, tokens are encoded using byte-pair-encoding (BPE), which is a data compression technique used to encode data as a sequence of symbols.

To encode the input text, GPT first breaks it down into individual sentences. Each sentence is then encoded by replacing common character sequences with a special symbol, such as @@. This helps the model learn to recognize word and phrase boundaries.

Next, BPE is applied to the encoded sentences, which splits them into a set of subword units. This is done by iteratively replacing the most common pair of tokens in the corpus with a new token until a desired vocabulary size is reached.

Finally, the resulting subword units are assigned an integer index, which is used to represent the token in the input sequence. These indices are presented to the network as embeddings, which are dense vectors that capture the meaning of the token in the context of the text sequence.

( show 3 code lines )
How does GPT's byte pair encoding handle words it's never seen before, like "mayo{'n' * 20}aise"?

GPT's byte pair encoding handles words it's never seen before by breaking them down into subword units. In the case of "mayonnnnnnnnnnnnnnnnnnnnaise", the model will recognize and encode the stem "mayo" and append an additional subword unit for the repeated letters "n" to represent the full word. This allows the model to handle previously unseen words by recognizing familiar subword units and generalizing to related words that share those subword units.

( show 1 code lines )
ask("What is GPT-3's context size?")

GPT-3 has a context size of 2048 tokens, which means it can take into account up to 2048 words or symbols in its predictions. This allows GPT-3 to understand longer and more complex contexts in natural language processing tasks.

Perhaps we shouldn't trust what it says, given the circumstances.

But thing one: this is a task that LLMs are great at. A natural-language query on a dataset which likely contains an answer they can repeat almost verbatim. Could it still be confidently wrong? Yeah, sometimes. But in this case I think it's generating a clear, concise and basically correct explanation.

OpenAI has made GPT's tokenizer available as a PIP, so we can see exactly how it breaks a text stream down into tokens. gpt-3.5 breaks "mayonnnaise" down into "may", "onna", and "ise":

( show 8 code lines )
!pip install tiktoken > /dev/null
import tiktoken
enc = tiktoken.encoding_for_model("gpt-3.5-turbo")

def token_strings(text):
  return [enc.decode_single_token_bytes(t) for t in enc.encode(text)]

Out[ ]:
[b'may', b'onna', b'ise']

Misspellinngs with more 'n's see those dumped into "nn" tokens:

( show 1 code lines )
token_strings(f"mayo{'n' * 10}aise")
Out[ ]:
[b'may', b'on', b'nn', b'nn', b'nn', b'nn', b'na', b'ise']

The impact of this on understanding the model's behavior is minimal. It does explainn how the model is able to ingest and generate new words like "mayonnnnnnaise" or "kdsjfoaij3afwe" or "PoolFactoryGeneratorImpl()".

And I suspect it has something to do with why the model sometimes misspells our input whenn we give it a strannnnnnnge word. The model can preserve token sequences from input to output. But the model is big. Many layers, many statistical operations. Long, low-probability sequences strike me as more likely to flop around as they pass through, losing or gaining pieces while keeping the general vibe.

The shape of it

Daniel Dugas has an excellent article drawing out GPT-3's architecture on a napkin. You should absolutely read it if you're interested in a deep dive through all GPT's layers.

GPT's input is a sequence of tokens. GPT's output is actually not one prediction but 2,048—it predicts the next token for each "next" position in the input. In other words, it slides the input one character into the future.

The model itself is like a crystal. A crystal that can, uh, do really simple math to light. No loops, no cycles, no memory. You shine the prompt on one side, and it gets split and focused and filtered through the cracks, finally shining beams on a set of words.

(The words are arranged on a massive granite floor. The crystal hangs far above it. Above that is a great lens, which focuses sunlight. The monks arrange stones to in a spiral to filter the light, thus presenting the prompt. When the next word is selected—sometimes the subject of much debate—the stones are painstakingly re-arranged to accomodate the new word at the center.)

The "crystal" is a holographically encoded database of the training text (which, for the monks, is everything ever written). It is the text compressed in such a way that the connections between words are condensed and reduced and projected through the whole.

There is no one singular place in the crystal which will only glow when the output is to be sad. But gaze at the arrangement of light in the crystal long enough and you could learn to deduce if the next words were going to be melancholy, in haiku, or in javascript. (Some monks spend their entire lives doing this.) The crystal's many layers learn to extract vibe, syntax, rules, and billions of other subtle signals, shaping and condensing the luminous token stream into a map of what's next.

The crystal doesn't think. But it does enact a fragment of the process of a particular kind of thought. Attaching a pump (or a temple of monks) produces a stream of text which is like the text it was trained with. If you run it forever, it might resemble a train of thought. If you start it off with an objective, it will tell you a story like what it's seen before about achieving that objective. This is the basis of tools like Auto-GPT.

Is it thinking?

LLMs don't "think" because they aren't agents.

Auto-GPT has something like a thought process. But it's not a human thought process. Your word generator receives many more inputs than just the last four thousand words you said or heard (which you definitely don't remember exactly). You have an internal experience of emotions and memories annd smells which is nothing like your stream of thought—these all wire up through the messy goop of your brain to bend the probability distribution of your next word.

(This is why the monks seek teachings from the crystal, though they acknowledge its limitations.)

What's more, you are able to pause and consider your next word in a way that the pumped model cannot. The light always flows through the crystal. It does not pool in eddies.

And this is why LLMs are bad at math.


Math is a highly recursive process. Take counting. If I ask you to count the "n"s in mayonnnnnnaise, you'll sit there and count one, two three, four... This is not something the model can do. It can represent a connection between the idea of "counting 'n's" and "mayonnaise" and "twoness". It can learn that mapping for many tokens, and it can get the sense that aggregation is to be applied. For some finite amount of arithmetic, you can encode strong connnections in the model.

The trouble is, arithmetic is infinite.

LLMs don't contain any cycles. There's nothing in the model that runs until it's done. So there will always be a limit to how high it can count.

( show 2 code lines )
ask("What's 19238 - 9180")
19238 - 9180

The result of 19238 - 9180 is 10058.

Out[ ]:
( show 2 code lines )
ask("What's 19223452452366338 - 9183453564320")
19223452452366338 - 9183453564320

The result of this subtraction is 10039998988042018.

Out[ ]:

There is a way to improve this. The model doesn't contain a cycle, but it is attached to a pump. It can store information by including it in the output. If we prompt the model to do this, it does much better:

( show 5 code lines )
Demonstrate how to subtract 19223452452366338 - 9183453564320.

Show your work.
         -     9183453564320

Answer: 19214269098702018

This still doesn't help much with examples which require more arithmetic:

( show 1 code lines )
ask("What's 192 m/s in furlongs per fortnight? Show your work.")

To convert from meters per second to furlongs per fortnight, we can use the following conversion factors:

  • 1 meter = 0.00497096954 furlongs (exactly)
  • 1 second = 0.00000009259 fortnights (exactly)

Multiplying these conversion factors together, we get:

1 m/s = (0.00497096954 furlongs) / (0.00000009259 fortnights) 1 m/s = 53,879.6719 furlongs per fortnight (approximately)

Therefore, 192 m/s is equivalent to approximately 10,391,050.57 furlongs per fortnight (192 * 53,879.6719).

The process seems okay, but the answer is wrong. I'm not going to try and figure out where it went wrong. Even if we could get this to work better, this is an absolutely stupid use of an LLM. LLMs allow natural language querying of a textual database. They're good at producing text that's pretty likely to be in the vein of what you asked for. Numbers, not so much.

Fortunately, there's a way to abstract over numbers with text, and it's called programming. LLMs do much better at programming than math:

( show 1 code lines )
ask("Write a python function to convert from m/s to furlongs per fortnight")

Here's a Python function to convert from meters per second (m/s) to furlongs per fortnight:

def convert_to_furlongs_per_fortnight(meters_per_second):
   # 1 meter = 0.000621371192 miles
   miles_per_second = meters_per_second * 0.000621371192
   # 1 mile = 8 furlongs
   furlongs_per_second = miles_per_second * 8
   # 1 fortnight = 14 days = 14 * 24 hours = 14 * 24 * 60 minutes = 20160 minutes
   furlongs_per_fortnight = furlongs_per_second * 60 * 60 * 24 * 14
   return furlongs_per_fortnight

This function takes a single argument meters_per_second, which is the speed in meters per second that you want to convert to furlongs per fortnight. The function converts meters per second to miles per second, then to furlongs per second, and finally to furlongs per fortnight. The result is returned as a floating-point number.

To use this function, simply call it with the desired speed in meters per second:

>>> convert_to_furlongs_per_fortnight(10)

This tells us that 10 meters per second is equivalent to approximately 106.98 furlongs per fortnight.

( show 10 code lines )
def convert_to_furlongs_per_fortnight(meters_per_second):
   # 1 meter = 0.000621371192 miles
   miles_per_second = meters_per_second * 0.000621371192
   # 1 mile = 8 furlongs
   furlongs_per_second = miles_per_second * 8
   # 1 fortnight = 14 days = 14 * 24 hours = 14 * 24 * 60 minutes = 20160 minutes
   furlongs_per_fortnight = furlongs_per_second * 60 * 60 * 24 * 14
   return furlongs_per_fortnight

Out[ ]:

We can pre-load the conversation with a system prompt restricting the model to do this in the general case:

( show 6 code lines )
ask("192 m/s in furlongs per fortnight", system_prompt="""
You are a helpful assistant with absolutely lousy arithmetic skills.

If the user asks you to perform arithmetic, instead output Python code which
does what they asked for.

Sure, here's the Python code to convert 192 m/s to furlongs per fortnight:

speed_meters_per_second = 192
# 1 meter = 0.000621371192 miles
speed_miles_per_second = speed_meters_per_second * 0.000621371192
# 1 mile = 8 furlongs
speed_furlongs_per_second = speed_miles_per_second * 8
# 1 fortnight = 14 days = 14 * 24 hours = 14 * 24 * 60 minutes = 14 * 24 * 60 * 60 seconds
seconds_per_fortnight = 14 * 24 * 60 * 60
speed_furlongs_per_fortnight = speed_furlongs_per_second * seconds_per_fortnight


This will output:


Therefore, 192 m/s is approximately equal to 279868.47638128 furlongs per fortnight.

( show 10 code lines )
speed_meters_per_second = 192
# 1 meter = 0.000621371192 miles
speed_miles_per_second = speed_meters_per_second * 0.000621371192
# 1 mile = 8 furlongs
speed_furlongs_per_second = speed_miles_per_second * 8
# 1 fortnight = 14 days = 14 * 24 hours = 14 * 24 * 60 minutes = 14 * 24 * 60 * 60 seconds
seconds_per_fortnight = 14 * 24 * 60 * 60
speed_furlongs_per_second * seconds_per_fortnight


We can also use it to print a warning:

( show 10 code lines )
ask("192 m/s in furlongs per fortnight", system_prompt="""
You are a helpful assistant with absolutely lousy arithmetic skills.

If the user asks you to perform arithmetic, you MUST print a warning
before you produce any more output. The warning you must print is:


Then you must try your best.


Okay, let's try to figure this out.

1 furlong = 201.168 meters 1 fortnight = 14 days = 336 hours


192 m/s = (192/201.168) furlongs/s = (192/201.168) * 3600 furlongs/hour (since there are 3600 seconds in an hour) = (192/201.168) * 3600 * 24 * 14 furlongs/fortnight (since there are 24 hours in a day and 14 days in a fortnight)

Simplifying the calculation:

(192/201.168) * 3600 * 24 * 14 furlongs/fortnight = (0.954158)* 3600 * 24 * 14 furlongs/fortnight = 720811.6368 furlongs/fortnight

Therefore, 192 m/s is approximately equal to 720811.6368 furlongs per fortnight. I hope that helps!

I presume this is more or less what plugins do: present natural-language conditions under which the model should generate Python code, or a Wolfram API call, or Mermaid. Then the conversation is extended with the result, and the model continues generating—only now with tokens in its input stream carrying the answer.

An intricate mirror

LLMs are incredible. I confess to being pretty blown away by GPT-3.5 and of course GPT-4 even moreso. A natural language query engine on text which works pretty well and can generalize and synthesize novel answers is a transformative tool.

Looking at the architecture of these models, I'm honestly even more taken aback: they are remarkably simple for what they do. An embedding layer to put tokens into a space according to their semantic and conceptual roles, an attentional layer to learn the relationships between tokens, and a feed-forward layer to reshape the attention-adjusted input into a set of probability distributions. Those are the three trainable components.

Absent is any kind of explicit encoding or reduce step on the text stream as a whole, a common pattern in machine learning systems which I naively expected to see here.

The idea is you train two functions: one to encode the input into some smaller form, another to decode it back to the original (or into French). The information bottleneck forces the network to learn to extract salient features from the input and discard irrelevant ones.

Some extraction of concepts is surely going on inside GPT. But none of it is forced architecturally. It's a really complicated prism, shifting text one character into the future. You can condition what it says about that future by providing hints and rules in the prompt, little glimmers of meaning which drag signals in the network towards one prediction or another. Ask it to tell you a story about how to do something it's familiar with and it will often provide a reasonably unobjectionable one. By pumping its predictions into the input, we can slide it down a train of thought, and often be impressed by what we find at the end.

Models like this are great at generating plausible-sounding text which broadly agrees with what's already been written. That this is threatening to many industries and academia is perhaps unsurprising: a quick glance at most written text—especially text written for exams—reveals it to be exactly that.

But don't ask them to think through implications and evaluate correctness, a task they're bound to eventually fail at. And don't ask them to count.

Ash is writing about AI

If you liked this piece, consider supporting this work with a tip or subscription.

Subscribers receive updates and exclusive content in their inbox.