Standout Systems by Teodora: AI in 60 Seconds

King - Man + Woman = Queen. How AI Learned the Geometry of Meaning.

Dr Teodora Szasz — Thu, 26 Mar 2026 17:04:05 GMT

Watch the video above. Below: the part that still blows my mind after years of working in AI.

Quick recap. Episode 8: everything gets chopped into tokens. But a TOKEN is just a symbol - “cat” or “un” or a 16×16 image patch. Symbols mean nothing to a computer.

So how does the model go from symbols to understanding?

Embeddings. And they are wild.

Symbols in, meaning out

An embedding converts a token into a long list of numbers. For GPT-5, each token becomes a vector of roughly 12,000 numbers. Think of those numbers as coordinates - a position and direction in a very high-dimensional space.

The critical property: similar meaning = pointing in the same direction.

Standout Systems by Teodora is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

“Happy” and “joyful” end up pointing nearly the same way.
“Happy” and “carburetor” point in completely different directions.
“Espresso” and “cappuccino” - almost parallel.
“Espresso” and “democracy” - nothing in common.

The model learns these positions during training. Nobody hand-places them. They emerge from billions of next-word predictions. Words that appear in similar contexts develop similar embeddings. Because if “happy” and “joyful” can replace each other in most sentences, the model learns to give them similar directions in embedding space.

Meaning becomes geography.

How the model measures “similar”

So if similar things point the same way - how does the model actually measure that?

Cosine similarity. It’s simpler than it sounds. You can also find a deep dive on cosine similarity in the RAG for Healthcare series:

Imagine two arrows starting from the same point. Cosine similarity measures the angle between them:

If they point in the same direction, angle close to zero, the similarity score is close to 1.
If they point in perpendicular directions, the score is close to 0.
If they point in opposite directions, the score is close to -1.

That’s it. No complicated formula needed to understand it.

Just: same direction = similar meaning.

Think of a compass:

“Happy” and “joyful” are both pointing roughly northeast.
“Happy” and “sad” point in opposite directions.
“Happy” and “refrigerator” point in directions that have nothing to do with each other.

The elegant part: cosine similarity doesn’t care about magnitude (how “far” a point is from the origin). Only the direction matters.
A whispered “happy” and a shouted “HAPPY” might have different magnitudes, but they point the same way. The meaning is the same.

This single measurement (the angle between two vectors) is the foundation of how AI decides whether two things are related. Every similarity comparison, every search-by-meaning, every “find me something like this” starts with cosine similarity on embeddings.

Remember this. It comes back in a big way when we talk about semantic search and RAG.

Now here’s where it gets wild

In 2013, researchers at Google trained a simple word embedding model called Word2Vec and decided to visualize the results. What they found stunned the field.

The vector from “king” to “queen” was almost identical to the vector from “man” to “woman.”

King - Man + Woman = Queen. Actual vector arithmetic. Actually works.

But it wasn’t just gender. They found parallel structures everywhere:

“France” → “Paris” has the same direction as “Japan” → “Tokyo.” Country-to-capital is a consistent direction in embedding space.

“Walk” → “walked” has the same direction as “swim” → “swam.” Verb tense is a direction.

“Big” → “bigger” has the same direction as “small” → “smaller.” Comparative degree is a direction.

Nobody programmed any of this. The model was just trained to predict nearby words. And from that simple task, it discovered that meaning has geometry. That abstract relationships - gender, tense, geography, degree - are directions you can measure, compare, and compute with.

This was the moment the field realized: something deep is happening in these representations. Not just pattern matching. Something that looks like structured understanding of the world.

This connects everything

If you have been following the series, embeddings are the thread that ties it all together.

Self-supervised learning (Episode 6): the model predicts the next word and learns meaning. Embeddings are WHERE that meaning gets stored - as directions in high-dimensional space.

Latent prediction (Episode 7): when EchoJEPA predicts in latent space instead of pixel space, it’s predicting embeddings —-points in a space where similar cardiac states point in the same direction, and where noise has been smoothed away by the EMA target encoder.

The “latent space” IS an embedding space. And when we measure whether the model’s prediction is good, we’re essentially asking: does this prediction point in the same direction as the target? That’s cosine similarity at work.

Attention (Episode 2): when a word “attends to” another word, it’s comparing their embeddings - asking “how relevant is your meaning to mine?” The Query and Key vectors in the attention mechanism are embeddings, and the attention score between them is related to their similarity in that space.

Embeddings aren’t just a step in the pipeline. They are the medium in which the entire model thinks.

The practical superpower: search by meaning

Here’s the part that turns all of this from fascinating theory into something you can use.

Once meaning is geometry - once every piece of text, every image, every question is a direction in space - you can SEARCH by meaning. Not by keyword matching. By actual semantic similarity.

Embed a question. Embed a million documents. Compute the cosine similarity between your question and every document. The highest scores - the documents pointing most closely in the same direction as your question - are your most relevant results.

Search “how to fix a broken heart” and the model knows whether you mean cardiology or a breakup - because the surrounding context shifts the embedding to a completely different region of space. A keyword search returns both. A semantic search understands which one you need.

This is the foundation of how AI retrieves knowledge from databases, how smart search works, how AI assistants find the right information to answer your question. We’ll go deep on this in a future episode - but it all starts here, with embeddings and cosine similarity.

The one thing to remember

Embeddings turn symbols into directions in space where meaning has geometry.

Similar things point the same way. Relationships are directions. And the angle between two embeddings (cosine similarity) tells you how related they are.

Once meaning is math, you can compute with it, search it, and build everything modern AI does on top of it.

King − Man + Woman = Queen.

Nobody taught it that. The model discovered the shape of meaning.

Next up: We have talked about tokens going IN to the Transformer. But there are actually two fundamentally different machines inside: one that reads and one that writes.
What’s an encoder? What’s a decoder? And why does it matter?

I’m Teodora - AI/ML scientist. The embedding space of heart ultrasounds is where I spend most of my research time. Subscribe to Standout Systems for more.

Tokens: AI's Real Alphabet (And Why It Explains So Many AI Quirks)

Dr Teodora Szasz — Thu, 19 Mar 2026 17:03:42 GMT

I have a confession. For seven episodes, I have been casually saying things like “words go into the Transformer” and “image patches enter the model.” It was a useful simplification. But today it’s time for the truth.

Transformers don’t see words. They see tokens. And this one detail - how data gets chopped up before the model ever touches it - explains an absurd number of things about AI that people find confusing, frustrating, or outright weird.

What is a token?

A token is the smallest unit a Transformer processes. Think of it as AI’s alphabet - except it’s not letters, and it’s not words. It’s something in between.

Thanks for reading Standout Systems by Teodora! This post is public so feel free to share it.

For text, the intuitive assumption is: one word = one token. And for common, short words, that’s true. “The” is one token. “Cat” is one token. “Run” is one token. “Paris” is one token.

But less common words get broken apart. “Unbelievable” becomes three tokens: “un” + “believ” + “able.” “Echocardiography” - a word I type daily - probably becomes four or five tokens. “Spaghettification” - the physics term for what a black hole does to you - might be six or seven.

The model splits text using an algorithm called Byte Pair Encoding (BPE). You start with individual characters and iteratively merge the most common pairs. After enough merging, you end up with a vocabulary - typically 30,000 to 100,000 tokens - that represents the most efficient way to encode the text the model was trained on.

Common patterns survive as single tokens. Rare ones get decomposed into familiar subparts. It’s actually similar to how your brain reads - you recognize frequent words instantly as whole units, but slow down and break unfamiliar words into syllables.

The quirks this explains

This sounds like boring preprocessing. It’s not. It explains things you’ve probably noticed and wondered about.

“How many r’s in strawberry?”

This is maybe the most famous AI failure. Ask almost any language model, and it’ll often get it wrong - saying 2 instead of 3. The internet loves mocking this. But the reason is simple: “strawberry” is tokenized as one or two tokens. The model never sees individual letters. It literally cannot count what it cannot see.

It’s like asking you how many times the letter ‘e’ appears in “nevertheless” - but you can only see the word as a single shape, not as individual characters. You’d probably have to think about it too.

AI struggles with arithmetic

Ask an AI to add 1,847 + 3,926 and it might stumble. Not because it’s bad at math, but because those numbers don’t enter the model as clean digits. “1847” might be tokenized as “18” + “47” or “1” + “847” depending on the tokenizer. The model is trying to do math on broken-up number chunks. It’s like someone handed you an addition problem but scrambled which digits go together.

This is why newer models use special techniques for math - some process numbers digit by digit, effectively re-tokenizing for arithmetic tasks.

Why AI sounds smarter in English

Here’s a fairness issue hiding in tokenization that most people never think about.

Tokenizers are trained on data - and most tokenizer training data is heavily English. This means English text is tokenized efficiently: about 1.3 tokens per word on average. Common English words and patterns map to single tokens.

But other languages - Japanese, Chinese, Korean, Hindi, Arabic, and many more - are tokenized much less efficiently. The same meaning might require two to three times as many tokens.

This has cascading consequences. A 128,000-token context window holds roughly 100,000 words of English. But it might hold only 40,000-50,000 words worth of Japanese content. Same model, same price, dramatically less capacity.

It also means the model has “seen” each Japanese token far fewer times during training than each English token, so its understanding is thinner. Response quality is lower. Translation is worse. Reasoning is weaker.

The tokenizer - this seemingly boring text preprocessing step - creates a structural disadvantage for billions of non-English speakers. It’s one of those responsible AI issues that hides in the infrastructure, invisible unless you know where to look.

Tokens beyond text

If you’ve been following the series, you already understand visual tokenization - you just didn’t know that’s what it was called.

In Episode 3, I explained that Vision Transformers cut an image into a grid of patches - say 16×16 pixels each. Each patch IS a token. It gets projected into a numerical representation and enters the Transformer alongside positional information so the model knows where the patch came from in the original image.

For video - our domain - each space-time patch is a token. A small cube spanning a few pixels across a few consecutive frames. In EchoJEPA, every heart ultrasound video becomes a sequence of hundreds of visual tokens. Attention runs across all of them - spatial and temporal - letting the model track how cardiac structures move over time.

For audio, the equivalent is spectral patches - small windows of the audio frequency spectrum become tokens.

The unifying principle: regardless of modality, step one is always the same. Chop the raw input into tokens. Give each token a numerical representation. Feed the sequence into the Transformer. This is the universal interface. Text tokens, image tokens, video tokens, audio tokens - the Transformer doesn’t care. It just processes sequences.

The tokenization design choice

Here’s something that connects back to Episode 7. When we talked about pixel prediction vs. latent prediction in EchoJEPA, we were really talking about two different levels of tokenization.

The MAE approach tokenizes at the pixel level - each patch token represents raw pixel values. The JEPA approach tokenizes at the latent level - each token represents a meaningful abstraction where noise has been smoothed away.

Same video. Same patches. But what those tokens CONTAIN - raw pixels or cleaned-up representations - changes what the model learns. The token is just the container. What you put inside determines everything.

The one thing to remember

A token is AI’s fundamental unit of perception. Everything - words, numbers, image patches, video frames, audio segments - gets converted into tokens before a Transformer can process it.

And the way you tokenize shapes everything downstream: what the model can see (individual letters? no), how much context it can hold (depends on your language), who it works well for (English speakers, mostly), and what it struggles with (counting, arithmetic, anything that requires seeing what the tokenizer hid).

Next time an AI does something weird, before blaming the model - ask yourself: what did the tokens look like?

Next up: OK so we have tokens. But a token is still just a symbol - “un” or “believ” or a 16×16 image patch. The Transformer speaks math. How does a word become a number? How does that number carry MEANING? That’s EMBEDDINGS - and it’s where the magic truly lives. Wednesday.

I’m Teodora - AI/ML scientist. Follow Standout Systems for more AI in about 60 seconds.

Thanks for reading Standout Systems by Teodora! This post is public so feel free to share it.

Predict Pixels or Predict Meaning? The Decision That Defined Our Heart AI

Dr Teodora Szasz — Sat, 14 Mar 2026 17:12:39 GMT

Watch the video above.

Below: the full story of a design choice that changed our AI model’s performance by 26.7% - and what it teaches about building AI that actually works in the real world.

Subscribe now

Last episode I explained self-supervised learning - hide part of the data, predict what’s missing, learn from the process. Simple, elegant, powerful.

But I left you with a teaser: WHAT you predict matters as much as the prediction itself.

This episode is the payoff. And it’s personal - this is the single design decision we are most proud of in building EchoJEPA.

Two art students and a photograph

Let me start with an analogy.

Imagine two art students. You show both of them a photograph with a section cut out, and you ask them to fill in the missing area.

Student A is meticulous. They study the surrounding pixels, the exact color values, the grain of the paper, every tiny imperfection - and try to reproduce the missing region as a pixel-perfect copy. If there’s a smudge on the lens, they reproduce the smudge. If there’s film grain, they reproduce the grain.

Student B takes a different approach. They look at the scene and think: “OK, there’s a tree here, the light source is to the left, the leaves are this shade of green, there’s a shadow at this angle.” Then they fill in the missing region based on their understanding of what’s IN the image - the objects, the structure, the relationships.

Student A works in pixel space - predicting the literal surface values. Student B works in latent space - predicting the meaning behind the surface.

For a clean, high-quality photograph, both students do well. The difference is barely noticeable.

But now give them a noisy image. And watch what happens.

The noise problem in ultrasound

Echocardiography (heart ultrasound) is not like a photograph. Every ultrasound image contains speckle noise: a shimmering, grainy texture that’s fundamental to how ultrasound physics works. Sound waves bounce off tissue, interfere with each other, and create patterns that look random.

And crucially: speckle IS random. It’s different in every single frame, even when the heart underneath is in the exact same position doing the exact same thing. Frame 1 and frame 2 might show the same ventricle, but the speckle pattern is completely different.

This creates a problem for Student A - the pixel predictor.

When you ask a model to predict the exact pixel values of a masked region in an ultrasound image, the model MUST reproduce the speckle noise. It’s part of the pixels. The model can’t distinguish “these pixels represent the heart wall” from “these pixels represent random speckle.” It’s all just numbers.

So the model dedicates significant capacity - significant learning - to modeling the patterns in randomness. It tries to learn the structure of noise.

It’s like asking someone to memorize TV static. There’s nothing to memorize. But the model will try.

The result: a model that’s mediocre at understanding hearts because it spent too much of its learning budget on understanding noise.

Student B: predict in latent space

Our solution was to become Student B. Don’t predict pixels. Predict in latent space: compressed, meaningful representations that capture what’s in the image rather than what the image literally looks like.

This is the JEPA approach: Joint-Embedding Predictive Architecture.

Here’s how it works mechanically. We have two encoders:

The main encoder: this is the model we’re training. It processes the visible (unmasked) parts of the video and tries to predict representations of the masked parts.

The target encoder: this processes the full video (including the masked parts) and produces the “correct answer” the main encoder is trying to match.

The target encoder is an EMA (Exponential Moving Average) of the main encoder. Instead of being trained directly, it’s updated as a running average of the main encoder’s weights over thousands of training steps.

This detail sounds minor. It’s the whole game.

Why averaging kills noise and preserves signal

Here’s the key insight: and once you see it, you can’t unsee it.

Speckle noise is random. In frame 1, a particular pixel might be bright due to noise. In frame 2, that same pixel might be dark. In frame 3, medium. Over thousands of frames, the noise at any given location varies randomly around some average value.

When the EMA target encoder averages over thousands of training steps, each step processing different frames with different noise patterns, the random variations cancel out. Noise that’s different every time averages toward zero.

But the heart is consistent. The left ventricle wall is in the same place, frame after frame. It contracts and relaxes in the same pattern. The valve opens and closes at the same position. Consistent structure, repeated across every frame, accumulates through averaging. It gets reinforced.

The EMA target encoder produces representations where the noise has been naturally smoothed away and the anatomy has been naturally amplified. Clean, stable descriptions of cardiac structure - for free, through the mathematics of averaging.

When we train the main encoder to predict these clean targets from noisy inputs, it learns: “ignore the noise, find the structure.” It learns to extract meaning from messy data. It learns hearts.

The evidence

Theory is nice. Numbers are better.

Accuracy: On estimating left ventricular ejection fraction (the single most important measurement in clinical cardiology) our JEPA approach outperformed the Masked Autoencoder (pixel prediction) baseline by 26.7%. Not a marginal improvement. A generational leap.

Robustness: We deliberately degraded image quality to simulate real-world conditions - poor acoustic windows, probe movement, patient habitus issues. Our model showed 2.3% performance degradation. The best pixel-level baseline showed 16.8%. Our approach was 86% less sensitive to image quality.

Transfer: Trained entirely on adult echocardiography. Tested on pediatric hearts (different size, different proportions, faster heart rates). Zero additional training. Mean absolute error: 4.32 (ours) versus 5.10 (baseline). The foundation transferred to a population it had never seen.

Attention visualization: When we mapped where the model focuses its attention, it consistently localized on cardiac structures - chambers, valves, walls - not image artifacts or noise. The model learned anatomy. Evidence that it learned the right thing.

The broader principle

This extends far beyond heart ultrasound. Any time your data has noise, irrelevant variation, or messy real-world conditions (which is almost all real-world data) the choice of what you predict matters enormously.

Manufacturing images have lighting variation. Satellite images have atmospheric distortion. Audio recordings have background noise. Medical images have scanner-specific artifacts.

In every case, predicting raw pixel values forces the model to model the mess. Predicting in a cleaned-up latent space focuses the model on what matters.

It’s a design philosophy: don’t ask the model to learn everything. Ask it to learn what’s important.

The one thing to remember

The space you predict in determines what the model learns.

Predict pixels and you learn the surface - including all the noise, artifacts, and irrelevant variation. Predict meaning and you learn the structure - the anatomy, the patterns, the relationships that actually matter.

Sometimes the most important decision in building an AI system isn’t the architecture, the dataset, or the compute budget. It’s what you ask the model to do.

Next up: We keep saying “words go in” and “patches go in.” But HOW does data actually enter a Transformer? What IS a token? The answer is simpler than you think - and it explains some weird things about AI you have probably noticed.

I’m Teodora - AI/ML scientist and co-author on EchoJEPA. This design decision - latent prediction over pixel prediction - is the one I’m most proud of. Subscribe to Standout Systems for more.

The Most Important Idea in AI You've Probably Never Heard Of

Dr Teodora Szasz — Sat, 28 Feb 2026 18:03:13 GMT

Watch the video above.

Below: Why self-supervised learning changed everything. And why your brain has been doing it your whole life.

Last episode I told you about emergence: how foundation models develop abilities nobody programmed. Arithmetic appearing from nowhere. Cardiac anatomy discovered without labels.

But I skipped the question that makes all of that possible: If nobody labeled the data, how did the model learn?

The answer is called self-supervised learning. It’s arguably the single most important idea in modern AI. And it starts with something your brain has been doing since the day you were born.

You already do this

Right now, as you read this sentence, your brain is predicting. Before your eyes reach each word, your brain has already generated a probabilistic guess about what’s coming next. When the word matches your prediction - you read faster, barely registering it. When it doesn’t match, you slow down. You notice. You update your understanding.

That mismatch - prediction versus reality - is the learning signal.

You do this constantly. Walking through a doorway, your brain predicts the room on the other side. Listening to music, it predicts the next note. Having a conversation, it predicts how the sentence ends. You are a prediction machine, and the errors are how you learn about the world.

Self-supervised learning takes this principle and formalizes it for AI. Take your data. Hide part of it. Ask the model to predict the hidden part. If it gets good at predicting, it must have learned something real.

The data provides its own supervision. No external labels needed. No humans in the loop. Just: hide, predict, learn.

Three flavors, one principle

The beauty is that this works for ANY type of data. The format changes. The principle never does.

Text: predict the next word.

“The capital of France is ___.” Paris. “After the rain, the streets were ___.” Wet. “E equals MC ___.” Squared.

GPT is trained this way. Every large language model is. Hide the next word, predict it, learn from billions of sentences. By getting good at this prediction task, the model learns grammar, facts, reasoning, common sense, humor, cultural references. None of which were in the training objective. The only objective was: predict the next word.

I wrote about this in an earlier article: the “sophisticated autocomplete” framing. And it’s still the most accurate description. But autocomplete at this scale, with this much data, produces something that looks astonishingly like understanding.

Images: predict the missing patch.

Take a photograph. Mask out 75% of it: random patches removed, like a jigsaw puzzle with most pieces missing. Ask the model to predict what was there. If it can fill in the missing eye in a face, the missing wheel on a car, the missing branch on a tree: it must have learned what those objects look like, how they’re structured, how they relate to each other.

This is how Meta’s MAE (Masked Autoencoder) works. Mask aggressively. Predict. Learn visual representations from millions of unlabeled images.

Audio: predict the missing chunk.

Remove a segment from a speech recording or a song. Ask the model to predict what goes in the gap. If it can reconstruct the missing melody, the missing phonemes, the missing rhythm, it learned the structure of sound.

This is how speech models like wav2vec learn. Same principle, different modality.

Video: predict across space AND time.

This is our domain. Take a heart ultrasound video. Mask out small space-time regions - cubes spanning a few pixels across a few frames. Ask: what’s missing?

If the model can predict what a masked region of a beating heart looks like in frame 15 based on the surrounding spatial context and the temporal progression from frames 1 through 14 - it must have learned how hearts look, how they move, what’s normal, and what’s not.

One principle. Every modality. HIDE, PREDICT, LEARN.

Why this single idea changed everything

Before self-supervised learning, AI hit a wall. The wall wasn’t data: data was everywhere. The wall was labels.

Training a supervised model to detect heart disease requires a cardiologist to annotate each image. At scale, 18 million videos, that means thousands of doctors working for years. Millions of dollars. The bottleneck was never the data itself. It was the human effort required to make the data usable.

Self-supervised learning removed the bottleneck entirely.

Instead of requiring humans to label every example, the model creates its own training signal from the raw data. And suddenly, ALL data becomes training data. Every book ever written. Every medical image in every hospital archive. Every video ever recorded. Every audio file, every satellite image, every code repository.

This is why foundation models exploded. Not because of a new architecture - Transformers had existed since 2017. Not because of new hardware - GPUs had been improving steadily. But because self-supervised learning gave us a way to USE all the data. The data was always there. The method to learn from it changed.

The internet wasn’t a dataset until self-supervised learning made it one. Hospital archives weren’t training data until we learned to train without labels.

The version we built: what happens in latent space

Now, there’s a nuance here that sets up next episode’s topic.

When you do self-supervised learning on images, you have a choice. You can predict the exact pixel values of the masked region (the MAE approach). Or you can predict something more abstract: a compressed representation in latent space (the JEPA approach).

JEPA’s architect, Meta’s Chief AI Scientist, Yann LeCun.

For our echocardiography model, this choice mattered enormously. Ultrasound has speckle noise - random grainy texture that’s different every frame. Predicting exact pixels means predicting noise. Predicting in latent space means the noise gets averaged away and the model focuses on consistent structure - the actual anatomy.

We will go deep on this next episode. For now, just know: WHAT you predict in self-supervised learning matters as much as the prediction task itself.

What the model discovered

Here’s the part that still gives me chills.

We trained EchoJEPA on 18 million echocardiography videos. The only task: predict masked space-time regions in latent space. We never told it what a ventricle is. Never showed it an anatomy textbook. Never labeled a single frame.

After training, we visualized where the model’s attention focuses. It had organized itself around the left ventricle, the right ventricle, the valve positions, the wall boundaries. It discovered the functional anatomy of the heart - from raw video alone.

Then we tested it on pediatric hearts - completely different from the adult hearts it trained on. Different size, different proportions, faster rates. Zero additional training. And it worked. The foundation transferred.

The model taught itself hearts. By predicting missing pieces. From 18 million unlabeled videos.

Hide. Predict. Learn. That’s it. And from that simple idea - everything in modern AI follows.

The one thing to remember

Self-supervised learning: your data is the teacher. Hide part of it, predict it back, and understanding emerges.

It’s how GPT learned language. It’s how Vision Transformers learned to see. It’s how our model learned cardiac anatomy. And it’s the reason foundation models exist - because self-supervised learning unlocked all the data that was sitting there, waiting to become knowledge.

Next up: when the model predicts the missing piece, WHAT should it predict? Exact pixels - or something deeper? That choice is the most important design decision we made in EchoJEPA. And it changed everything.

I’m Teodora: AI/ML scientist. I build foundation models that teach themselves from unlabeled clinical data. Follow Standout Systems for more AI in about 60 seconds.

When AI Surprises Its Own Creators

Dr Teodora Szasz — Thu, 26 Feb 2026 18:05:19 GMT

Watch the video above.
Below: the part of AI that genuinely keeps researchers up at night - in a good way and a bad way.

Last episode I explained the three steps of LLM training: pretraining on massive data, fine-tuning to be helpful, and RLHF to learn values. Straightforward enough.

But here’s what I left out: something happens at scale that nobody fully understands. And it’s the reason the term “foundation model” exists - and the reason some very smart people are both thrilled and worried.

The arithmetic mystery

In 2022, researchers at Google published a paper that documented something strange.

They took language models of increasing size - same architecture, same training data, just more parameters - and tested them on a bunch of tasks. For most tasks, performance improved gradually. A little bigger, a little better. Expected.

But for some tasks, performance did something else entirely.

At one billion parameters: near-zero accuracy on three-digit arithmetic. The model can’t add 472 + 385. At ten billion: still near-zero. Still can’t do it. At sixty billion parameters: suddenly 80%+ accuracy. The model can now do arithmetic. Not a little. A lot.

The jump wasn’t gradual. It was a phase transition - like water going from liquid to gas. One moment the ability isn’t there. The next moment it is.

This happened with task after task. Multi-step logical reasoning: nothing, nothing, nothing, suddenly yes. Translating between language pairs that weren’t explicitly in the training data: nothing, nothing, suddenly yes. Understanding sarcasm and irony: same pattern.

Nobody programmed these abilities. The training objective was always the same: predict the next word. But at sufficient scale, capabilities EMERGED that weren’t in the training signal.

Researchers call these emergent capabilities. And they are the defining feature of foundation models.

What actually IS a foundation model?

The term was coined in a 2021 Stanford paper, and the definition is specific. A foundation model has two properties:

Subscribe now

Scale. Not trained on a curated dataset for a single task. Trained on an enormous, broad dataset - the internet, millions of images, millions of videos. The scale is what enables emergence. You can’t get emergent capabilities from a small model; there aren’t enough parameters for the complex internal representations to form.

Generality. Because it’s trained on broad data with a general objective (predict the next word, predict the missing patch), the model develops general-purpose understanding that can be adapted to virtually any downstream task. One model, many applications. That’s the “foundation” - the base everything else is built on.

GPT-4 is a foundation model. So is Gemini. Apple’s AFM. Claude. Llama. And - in a different domain - EchoJEPA.

“But wait - emergence might be a mirage”

I want to be honest here because this is an active debate in the field, and I think you deserve the real picture.

In 2023, a group of Stanford researchers published a paper arguing that emergence might be partially an illusion of measurement. Their argument: if you measure model performance using metrics that have sharp thresholds (like exact-match accuracy - either the answer is exactly right or it’s wrong), you’ll see sudden jumps. But if you use smoother metrics (like how close the model gets to the right answer), the improvement looks more gradual.

In other words: maybe the model IS getting better at arithmetic at 10 billion parameters - just not well enough to get the exact right answer. And at 60 billion, it crosses the threshold of “close enough to be exactly right.” The phase transition might be in the metric, not the model.

This is a legitimate critique. And it’s probably partially true for some capabilities.

But not all. Some capabilities, like the model spontaneously developing the ability to perform tasks in languages it saw very little of during training, or learning to use tools it was never explicitly trained to use are harder to explain as measurement artifacts. Something IS happening at scale that goes beyond gradual improvement. We just don’t fully understand what.

And that honesty - “we don’t fully understand this” - is important. Because it’s the reason responsible AI work matters so much. If we can’t predict what capabilities will emerge at the next scale, we also can’t predict what FAILURES will emerge.

I saw emergence in hearts

This isn’t just a language model phenomenon. I saw it firsthand with EchoJEPA.

We trained a Vision Transformer on 18 million echocardiography videos. The training objective: predict masked regions in latent space. We never taught it anatomy. Never labeled a heart structure. Never told it what a ventricle is, where valves are, what normal motion looks like.

But when we visualized the model’s attention patterns - where it focuses when processing a video - it had learned to localize on cardiac chambers. On valve positions. On wall motion. The internal representations organized themselves around anatomically meaningful structures.

That’s emergence. The model discovered cardiac anatomy because understanding anatomy is useful for predicting what a missing region of heart video looks like. Nobody programmed that understanding. It emerged from the task and the scale.

And then something even more unexpected. We trained exclusively on adult echocardiography. Adults. Full-grown hearts with adult proportions. Then we tested on pediatric hearts - smaller, different chamber proportions, faster heart rates. Zero additional training. Zero pediatric data.

It worked. Mean absolute error of 4.32 compared to 5.10 for the best baseline. The foundation model had learned something general enough about cardiac anatomy and motion that it transferred across patient populations it had never seen.

That’s the power of a foundation. At sufficient scale, it doesn’t just memorize - it understands. And understanding transfers.

The honest picture: exciting and concerning

I want to be transparent about both sides because this is where a lot of AI commentary goes wrong - people either say “emergence is magical and AI is about to solve everything” or “emergence is terrifying and we should stop.”

What’s exciting: Foundation models develop capabilities nobody explicitly designed, which means they can solve problems we haven’t thought to train them for. EchoJEPA’s zero-shot transfer to pediatric hearts wasn’t planned - it was discovered. This kind of unexpected generalization is how foundation models could accelerate medical research, scientific discovery, and creative work in ways we can’t fully anticipate.

What’s concerning: If we can’t predict what capabilities emerge at the next scale, we also can’t predict what failure modes emerge. A model might develop the ability to produce convincing misinformation as an emergent capability of getting really good at language. It might develop biases that only appear at scale because they require complex reasoning to manifest. This is why responsible AI evaluation can’t be a one-time checklist - it needs to be continuous and proactive, testing for capabilities and failures we haven’t seen yet.

In my world - healthcare AI - this tension is very real. The same emergence that lets EchoJEPA discover cardiac anatomy without labels could, in theory, also learn subtle biases in the data - like patterns correlated with which patients get referred for imaging in the first place. That’s why our evaluation includes subgroup analysis across demographics, multi-site validation, and deliberate stress testing. You can’t trust emergence blindly. You have to verify what it learned.

The one thing to remember

A foundation model isn’t just a big model. It’s a model where scale produces something qualitatively new - capabilities that weren’t programmed, weren’t predicted, and aren’t fully understood.

That’s what makes them powerful. And that’s what makes Responsible AI work essential.

Next up: Foundation models learn from massive data. But nobody labeled that data. How does a model teach itself? Self-supervised learning - the trick that makes everything we discussed today possible. Stay tuned.

I’m Teodora - AI/ML scientist. I build foundation models for healthcare where emergence is both the breakthrough and the thing that keeps me carefully checking everything twice. Subscribe to Standout Systems for more.

Subscribe now

How Is an LLM Actually Trained? (It's Not Just "Read the Internet")

Dr Teodora Szasz — Tue, 24 Feb 2026 18:12:25 GMT

Watch the video above. Below: the full story - and why Step 3 is the one that keeps me up at night.

I saw a post this week that got a lot of engagement.
The gist: AI was trained on the internet, the same internet full of misinformation and conspiracy theories, and now we’re asking it to help us make decisions.

It’s a fair point. But it’s incomplete. And the incomplete version is actually more dangerous than the full picture - because it makes people either dismiss AI entirely or trust it blindly, when neither is appropriate.

So here’s the full picture. Three steps. Each one fundamentally changes what the model is.

Step 1: Pretraining - “Read everything, predict the next word”

This is the step that post was describing. And yes - it’s exactly as wild as it sounds.

You take an enormous amount of text. Books. Wikipedia. News articles. Academic papers. Reddit threads. Code repositories. Forum posts. Some filtered, some not. Trillions of tokens.

Then you give the model a single task: given the beginning of a sentence, predict the next word.

“The capital of France is ___.” → Paris. “Photosynthesis converts sunlight into ___.” → energy. “The earth is ___.” → ... well, it depends on context.

If the model gets good at prediction across trillions of examples, something remarkable happens. It doesn’t just memorize sequences - it learns grammar, facts, reasoning patterns, even rudimentary common sense. All from the pressure of predicting what comes next.

I wrote about this in a previous article — the “sophisticated autocomplete” framing. And I still think it’s the most honest description of what’s happening. The model is a prediction engine. But prediction at this scale produces something that looks a lot like understanding.

Here’s the critical thing about this step though: after pretraining, you don’t have ChatGPT. You have a base model. And a base model will happily write you a conspiracy theory with the same fluency as a physics textbook. It learned the patterns of ALL the text - the knowledge and the nonsense. It doesn’t know which is which.

The base model is not the model you talk to.

Step 2: Fine-tuning - “Learn to be helpful”

The base model is brilliant but useless as an assistant. Ask it “What’s the capital of France?” and it might continue writing a Wikipedia paragraph instead of answering your question. It doesn’t know it’s supposed to be helping you. It just predicts text.

Fine-tuning fixes this. Human trainers write thousands of example conversations:

User: What’s the capital of France? Good response: The capital of France is Paris.

User: Can you help me write an email? Good response: Of course! What’s the context and tone you’re going for?

The model learns from these examples. It learns the FORMAT of being an assistant - answer questions directly, be concise, ask for clarification when needed, stay helpful.

This step transforms the model from a general text predictor into something that actually tries to respond usefully. But it still has a problem: it doesn’t have strong judgment about what it SHOULD and SHOULDN’T say.

Step 3: RLHF - “Learn right from wrong”

This is the step most people don’t know about. And it’s the most important one for responsible AI.

RLHF stands for Reinforcement Learning from Human Feedback. Here’s how it works:

The model generates two different responses to the same prompt. Human evaluators look at both and choose: which one is better? More helpful? More accurate? Safer? Thousands of these comparisons.

From all those human judgments, you train a reward model - a separate AI that learns to predict which responses humans would prefer. Then you use that reward model as a coach: the main model generates a response, the reward model says “good” or “not good,” and over millions of iterations, the model learns to consistently produce preferred outputs.

This is where the model learns: don’t repeat conspiracy theories as fact. Don’t generate instructions for harmful activities. Acknowledge when you’re uncertain. Refuse requests that could cause harm.

Is it perfect? No. The quality of RLHF depends entirely on the quality of the human evaluation - who the evaluators are, what guidelines they follow, what biases they bring. If the evaluation only covers English, the model won’t be equally safe in Japanese. If the evaluators have blind spots, the model inherits them.

This is exactly the kind of problem I work on. Building robust evaluation pipelines, ensuring diverse representation in the evaluation process, measuring fairness across demographics and languages. The alignment step is where responsible AI engineering matters most.

Why the three-step picture matters

That viral post implied: AI learned from garbage, therefore AI outputs garbage.

The reality is more nuanced. The base model learned from everything - knowledge and nonsense alike. Fine-tuning taught it to be an assistant. RLHF taught it values and judgment.

The model that read the internet is not the model you talk to. Three transformations happened in between.

Does this mean current LLMs are perfectly safe? Absolutely not. RLHF can be gamed. Safety training can be bypassed. The models still hallucinate - confidently stating things that aren’t true. The evaluation pipelines have gaps.

But understanding the full training process matters because it tells you WHERE the problems live:

Hallucination? Mostly a pretraining problem - the model learned to always sound confident because confident text was in the training data.

Bias? Could be in any of the three steps - biased training data, biased fine-tuning examples, or biased human evaluators in RLHF.

Safety failures? Usually a gap in Step 3 - an adversarial prompt the safety training didn’t anticipate.

When you understand where problems come from, you stop saying “AI is garbage” or “AI is magic” and start asking the right question: where specifically is this model likely to fail, and how do I account for that?

That’s AI literacy. And that’s a career superpower.

The one thing to remember

Three steps. Pretraining gives the model knowledge (and nonsense). Fine-tuning gives it manners. RLHF gives it judgment.

The internet trained the base model. Humans trained the assistant.

I’m Teodora - AI/ML scientist. I build AI systems where the gap between Steps 1 and 3 can mean the difference between a correct diagnosis and a missed one. Subscribe to Standout Systems for more.

Same Trick, But for Images: Vision Transformers Explained

Dr Teodora Szasz — Sun, 22 Feb 2026 18:03:15 GMT

Watch the video above for the 120+ second version. Below: the part that didn't fit.

Here’s the thing that makes the Transformer architecture genuinely beautiful: it doesn’t care what you feed it.

Words? Sure. Image patches? Also works. Audio chunks? Yep. Video frames? That too.

The last two episodes covered how Transformers work on text - self-attention letting every word figure out which other words matter. Today: how the exact same idea works on images.

The patch trick

A Transformer expects a sequence. Text is already a sequence - words in a row. Images aren’t. They’re a 2D grid of pixels.

Subscribe now

So you make them a sequence. Take an image, divide it into a grid of small patches - say 16×16 pixels each. Flatten the grid into a row. Now you have a sequence of patches, just like a sequence of words.

Each patch gets embedded - converted into a numerical representation the model can work with. Then you feed the whole sequence into a standard Transformer and let self-attention run.

That’s it. That’s a Vision Transformer - ViT. Published by Google in 2020. Same attention mechanism from the 2017 “Attention Is All You Need” paper, applied to image patches instead of words.

What attention sees in an image

This is where it gets interesting. When attention runs across patches of an image, each patch learns to look at the patches that are most relevant to it.

A patch containing a dog’s ear attends to the patch containing the other ear and the face. A patch showing sky doesn’t attend much to a patch showing a shoe. The model builds an understanding of the image by figuring out which parts relate to which other parts - no one programs these relationships. They emerge from the data.

And because attention is global - every patch can attend to every other patch regardless of distance - the model handles long-range relationships naturally. The top-left corner of an image can directly attend to the bottom-right. In a traditional convolutional network, that kind of long-range connection requires many stacked layers. In a Vision Transformer, it’s one attention computation.

From images to video: space-time patches

Here’s where it clicked for me personally.

If you can cut a 2D image into patches, you can cut a 3D video into small cubes - patches that span both space and time. These are sometimes called tubelets. Each tubelet captures a small region of a few consecutive frames.

Feed those into a Transformer, run attention, and the model can track how things move and change across time. A patch showing the left ventricle wall in frame 1 attends to the same region in frame 10 - learning motion patterns.

This is exactly what we built in EchoJEPA. A Vision Transformer with 1.1 Billion parameters (ViT-G), processing heart ultrasound videos as space-time patches. Self-attention across all patches, across all frames. The model learned cardiac anatomy and motion from 18 million videos without any human labels telling it what a heart looks like.

And here’s the part that still amazes me: when we visualize where the attention focuses, it localizes on actual cardiac structures. Not image noise, not artifacts - the heart. The model discovered anatomy through attention alone.

Why one architecture for everything matters

This is the deeper point. The fact that the same Transformer architecture — same attention, same scaling - works for text AND images AND video means you can build models that understand multiple modalities at once.

Feed an image and a question into the same model. The text tokens attend to the image patches. The image patches attend to the text. That’s how Vision-Language Models work - the kind Apple uses for Apple Intelligence features that understand both what you see and what you type.

One architecture. Any input. That’s not just convenient engineering. It’s a conceptual breakthrough.

The one thing to remember

Cut it into patches. Run attention. Whether it’s a sentence, a photograph, or a heart ultrasound video - the Transformer doesn’t care. It just learns what to pay attention to.

I’m Teodora - AI/ML scientist. I co-built a foundation model for cardiac imaging using the exact architecture I just described. Subscribe to Standout Systems for more.

Attention - The Engine Inside Every AI You Use

Dr Teodora Szasz — Fri, 20 Feb 2026 18:03:19 GMT

Watch the video above for the 150-second version. Below: the stuff I couldn't fit.

Last episode I explained what a Transformer is - an architecture that sees the entire input at once instead of reading one word at a time.

But I glossed over the most important part. How does it know what to pay attention to?

The answer is a single mechanism. It’s called self-attention. And once you understand it, you understand the engine inside every major AI system today.

The crowded room

Here’s how I think about it.

You walk into a crowded room with a specific question in mind. Everyone’s wearing a name tag describing what they know. You scan the tags, find the people who match your question, and go listen to what they have to say.

That’s self-attention in three steps:

Query - your question. Each word generates one. The word “bank” might ask: “am I about money or about rivers?”

Key - the name tag. Every other word advertises what it contains. The word “deposit” holds up a key that says “finance.” The word “river” holds up a key that says “nature.”

Value - the actual information. Once “bank” finds that “deposit” is highly relevant (high attention score), it collects the information “deposit” carries. Now “bank” has been enriched with financial context.

Every word does this with every other word, simultaneously. That’s the “self” in self-attention - the sentence attending to itself.

Multiple spotlights, not just one

If this happened once, it would be useful but limited. One question per word, one type of relationship.

So Transformers run multiple attention computations in parallel. Each one is called an attention head, and each head learns to look for a different kind of relationship:

One head might learn grammar - which verb belongs to which subject, even when they’re far apart.

Another might learn meaning - grouping words by topic, connecting “climate” in paragraph one to “emissions” in paragraph three.

Another might track references - figuring out that “she” in sentence five refers to “Dr. Chen” in sentence two.

Together, it’s like having multiple spotlights in that crowded room, each one illuminating a different dimension of relationships. That’s multi-head attention - and it’s why Transformers understand language so much better than anything that came before.

Why this matters outside of text

Here’s what excites me as someone who builds AI for healthcare. The attention mechanism doesn’t care whether the input is words, image patches, or frames of a video.

When we built EchoJEPA - our foundation model for reading heart ultrasounds - the model uses attention to figure out which regions of the heart are relevant to each other across frames. The left ventricle wall in frame 10 “attends to” its position in frame 1 to track motion. Same mechanism, completely different domain.

And here’s a bonus: those attention patterns are interpretable. When we visualize where the model focuses, we can see it localizing on actual cardiac anatomy - not image artifacts. That’s not just cool engineering. For clinical AI, it’s evidence that the model learned the right thing.

The one-liner

Every word asks “who matters to me?”, finds the answer, and collects the information. Multiply that by several parallel spotlights. That’s attention.

Or as the paper that started it all said: Attention Is All You Need.

I’m Teodora - AI/ML scientist, building AI systems in healthcare. Subscribe to Standout Systems for more.

Read more about why I created this series and my background as an AI scientist here:

What IS a Transformer? (And Why You Should Care)

Dr Teodora Szasz — Wed, 18 Feb 2026 18:05:46 GMT

Welcome to the first episode of AI in 60 Seconds. Today, we are tackling the backbone of modern AI: Transformers.

I started this series because while AI terms are everywhere, clear explanations are rare. I want to cut through the jargon and offer real understanding from the perspective of someone who actually builds these systems.

Read more about why I created this series and my background as an AI scientist here: https://substack.com/@teodora485587/note/c-215967569

OK so here’s a fun fact. The architecture behind ChatGPT, Gemini, Claude, Apple Intelligence, and basically every AI you’ve touched this week - it all comes from one 2017 paper called “Attention Is All You Need.”

Eight authors. Eight pages. Changed everything.

So what did they actually figure out?

The old way was painfully slow

Before Transformers, AI read text the way you’d read a book if you could only see one word at a time and your memory faded with every step.

That’s what Recurrent Neural Networks (RNNs) did. Word by word. Sequentially. By the time the model reached the end of a paragraph, it had mostly forgotten the beginning.

People tried to fix this with Long Short-Term Memory (LSTM)s - basically giving the model a notebook to jot down important things. Better memory, same problem: still sequential. Still slow. Still couldn’t look at two distant words at the same time.

The breakthrough: just look at everything at once

The Transformer said: what if we stop reading one word at a time entirely?

Instead, it sees the entire input simultaneously and uses something called self-attention to figure out which words are relevant to which other words. Every word can attend to every other word, in parallel.

The word “bank” doesn’t stay generic. By the time it passes through the Transformer, it knows whether it means a river bank or a financial bank - because it looked at every other word in the sentence to figure that out.

The analogy I keep coming back to: going from reading one word at a time with a fading memory... to seeing the entire page at once.

Why did this win?

Three reasons:

It’s parallel. Every word processed at the same time = you can throw massive GPU clusters at it. RNNs couldn’t do this. That sequential bottleneck was a dealbreaker at scale.

It scales. More data + more parameters = better results, in a way that keeps working. That’s why we went from millions to billions to trillions of parameters. The architecture doesn’t break.

It works for everything. Text, images, audio, video. The same attention mechanism that helps a language model understand sentences helps a Vision Transformer understand images. I know this firsthand - when we built EchoJEPA, a foundation model for reading heart ultrasounds trained on 18 million videos, we used a Vision Transformer as the backbone. Same core idea, completely different domain.

The one thing to remember

A Transformer sees everything at once and learns what to pay attention to.

That’s the whole insight. And it’s why every major AI system in 2025 is built on it.

I’m Teodora - AI/ML scientist, building AI systems in healthcare. Follow Standout Systems for more.