Rendered at 23:41:39 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
aabhay 1 days ago [-]
In Gemini at least, if you look at how they process PDFs, they do an OCR and then feed the text + image to the model, without charging you for the text tokens (I believe).
So my guess is that Claude’s backend is doing the same — so this hack is probably more of a loophole in token accounting that might get closed if Claude is doing what Gemini does
hn_throwaway_99 1 days ago [-]
This is really fascinating to me. I was reading this article and originally agreed with you, "I mean, under the covers it's got to be converting to text tokens at some point, so there is no way it's actually cheaper for Claude itself to execute."
But then there is a comment below talking about how DeepSeek was able to get a huge improvement in compression by using visual tokens, https://news.ycombinator.com/item?id=48777848. I don't fully understand all of the underlying technical details so I am still fundamentally baffled about how going the OCR route could actually result in overall electricity/computational savings.
yorwba 1 days ago [-]
LLMs have a very bloated in-memory representation for text, on the order of megabytes of KV cache per byte of text. Meanwhile, for images a lossy representation is considered acceptable and it only takes up maybe a kilobyte of KV cache per byte of image. So if you can render your text into a hundred bytes of image per byte of text and then lossily expand it into 100 kB of KV cache per byte of text, you come out ahead!
Whether such lossy compression is acceptable for your use case is up to you.
Taek 1 days ago [-]
I don't think it's that bad, if I recall correctly it's about 8 kilobytes per token, and a token can be 3-4 characters so you're talking ~2 kilobytes per character.
An image token I recall is something like 16x16, so you get 32 bytes of overhead per pixel. And a character is minimally like 20 pixels including the whitespace, so you've jumped from 4 characters per token to maybe 12.
So 3x savings... which actually maps pretty closely to 60% savings.
esafak 1 days ago [-]
Can we get some refs for this number? If true it sounds like poor design.
Tuna-Fish 1 days ago [-]
It's not quite as bad as the parent made it out to be, the largest I've seen is 32kB per token (where sometimes, a token represents a byte, but usually it represents more than one.)
It's forced by the nature of how LLMs use vector embeddings for language.
Basically, a single token in a LLM is represented as a n-element vector, where n is the "hidden dimension", also known as model dimension. In order for the model to be smart, the hidden dimension needs to be large, on the order of 2^16 on top-tier models. Elements of this vector are typically quantized to 2-byte floats, or sometimes smaller. Every possible fact is embedded as a direction in this very many dimensional vector space, and a token is related to a fact if the vector representing that token points into a similar direction as that fact. You can do vector math about these things, famously for most trained models, if you find the vector embedding for king, man, woman and queen, and calculate king - man + woman, the result is very close to queen.
(Does that mean that there are 2^16 possible different kinds facts about things in this model? No, because high-dimensional geometry is very unintuitively powerful. The facts are not axis-aligned, and they don't need to be perfectly non-orthogonal. This matters, because the numbers of individual vectors you can fit into a single 2^16 dimensional space that are orthogonal with each other (all angles 90degrees) is of course 2^16. But, if you allow for almost orthogonal vectors, the number is larger than the amount of atoms in the universe. If this sounds wacky, for people with a CS background it can help to think it working a bit like a bloom filter, in that collisions are possible. Although in actuality they are theoretical, because 2^16 is a very large number.)
stingraycharles 1 days ago [-]
“I mean, under the covers it's got to be converting to text tokens at some point”
Multi-modal models do actually natively tokenize images, though. So it doesn’t have to be converted to text for it to work. They may do it anyway for accuracy, but it’s not at all required.
Effectively an image is scaled to a standard size, rasterized / cut up, and each cut is assigned a separate token, much in the same way text is tokenized. Train the model on this as well and you’ll end up having a model that can understand images.
qingcharles 1 days ago [-]
I am trying to get rough summaries of long PDFs of scanned pages of text. At first I was doing OCR and passing the (tens of thousands of) characters into the LLM, which works, but it's expensive.
I asked Gemini how to save costs and it said just send in all the images of the pages instead. Instinctively, as a developer, it's hard to fathom how sending 200 images is cheaper than sending the text, but it definitely works.
supern0va 1 days ago [-]
>This is really fascinating to me. I was reading this article and originally agreed with you, "I mean, under the covers it's got to be converting to text tokens at some point, so there is no way it's actually cheaper for Claude itself to execute."
It'd be weird if they were doing this, since it would mean the context window size was a lie and that the API would presumably reject requests whose expanded form went over the 1m limit. For someone using pxpipe with an effective context compression of 90% in some instances, it'd hit the limit at barely 100k.
nazgul17 14 hours ago [-]
That was DeepSeek OCR, not a DeepSeek lineage LLM. If the idea is introduced in LLMs, then you're right. But Gemini is not doing that, not yet. This is something I literally discussed with Claude last week, but took its word for it.
numeri 1 hours ago [-]
Deep seek OCR is an LLM, just one trained/post-trained specifically for OCR.
Exact details of text to image compression ratios are of course extremely dependent on the model architecture, training data, training objectives, etc., so there's probably not too much justification for generalizing to all models
can't explain with subsidies a model you host yourself (like deepseek)
measurablefunc 1 days ago [-]
Then you are paying for the electricity. It's not physically possible to do more computation & not use more energy b/c every arithmetic operation requires a minimum amount of energy so more operations = more energy.
1 days ago [-]
michaelt 1 days ago [-]
Not necessarily. See the paper See "DeepSeek-OCR: Contexts Optical Compression" [1]
One option, when an image is fed into an LLM, is to divide it into tiles, then those tiles pass through a 'vision encoder' neural network to make 'vision tokens' which are then input into the LLM much like text tokens are. Obviously you train the vision encoder and LLM to understand one another. This is known as an 'end-to-end OCR model'.
And it turns out, once you've trained a model to do this, you can vary the number of 'vision tokens' used to represent a given text document by scaling an image of a document up or down, and see what happens. You also get a load of other parameters like patch size and vision encoder complexity and so on.
Turns out it works really well; in some tests they used 90% fewer input tokens, but still got 97% output performance.
Claude Science has a tool to extract the PDF but not sure if it's OCR'ing it.
lpellis 1 days ago [-]
I tried the same thing last year (with openai models), back then it worked to reduce prompt tokens, but you needed way more completion tokens, ultimately more expensive (and slower)
https://pagewatch.ai/blog/post/llm-text-as-image-tokens/
aabhay 1 days ago [-]
Ahhh my eyes the vibe coded readme
sebmellen 1 days ago [-]
It’s so painful to read the LLM-compressed explanations. I can’t exactly identify what it is, but it’s an immediate tell and literally requires twice the effort to comprehend.
For example:
> Honest caveat, visible in the clip: the pxpipe arm answered the count first and needed one follow-up nudge to also print the ledger balance in the requested one-line format; the plain arm followed the format on the first try. Legibility is solved on Fable — single-reply format compliance is the remaining rough edge.
If I reread this four times, I can sort of interpolate what happened, but it’s mostly pointless and confusing information.
In my experience all models do this to an extent, but Claude seems to be the worst at this. GPT 5.5 is a bit more terse but seems to compress more valuable information.
Yokohiii 1 days ago [-]
My thought on this is that LLMs probably mimic writing patterns and structures from quality resources. But they don't construct a plausible thought hierarchy like an average human does, so their train of thought turns into a rollercoaster of thought. So the order of information is for humans completely out of order.
My guess is that it's a known problem, which steered the frontier models into bullet point preference.
quantummagic 1 days ago [-]
Here's one rewrite that would have helped:
To be fair, as you can see in the clip, the two models handled the prompt slightly differently. The pxpipe variant gave the right count initially but needed a quick follow-up to output the ledger balance in a single line. The standard model, on the other hand, nailed the formatting on its first try. We've completely solved readability here on Fable; our only real hurdle left is getting the models to follow formatting constraints perfectly on the very first reply.
Of course, this was just rewritten by another LLM.
sebmellen 1 days ago [-]
Yeah, but as you say, this is another LLM rewriting it. The amount of noncontextual information is nauseating and destroys the point of a README (in my humble opinion).
A human might have written a disclaimer like this:
> When not using Fable, pxpipe may require additional follow-ups to precisely follow your formatting instructions.
This kind of garbled information dump is very inconsiderate of the reader, and all good writing is considerate of the audience consuming it.
quantummagic 1 days ago [-]
To be fair to the rewriting LLM, it was given only that single paragraph, with no other context and asked only to make it more comprehensible. But your point still stands. Here's what it said about _your_ rewrite:
That human rewrite is excellent. It ruthlessly cuts out the "narrative" of the test case (the transaction counts, the video clip, the "first try vs second try" details) and extracts the only piece of information an actual user reading a README cares about: what to expect when using the software.
Which suggests some ideas that should have been included in the prompt, to get closer to your ideal rewrite.
rtgfhyuj 1 days ago [-]
[dead]
georgemcbay 1 days ago [-]
Reads a little bit better, but still reads like a writer getting paid by the word, which I guess is fitting.
cwmoore 1 days ago [-]
Or maybe: like a copywriter, paid by the dopamine hit
hashmap 1 days ago [-]
the foamy hedging makes me ill and hurts my eyes
adam_arthur 1 days ago [-]
It's about information density.
Most LLMs by default seem to write both text and code with low information density.
You can kind of get around it by prompting them to optimize for compactness, but most just let it run with a more generic prompt.
razodactyl 18 hours ago [-]
Concise responses come from people with writing skills who can get to the point without adding more words to sound smart.
"Elegant prose instantiated through remarkably tailored execution of written word may allow an author's desired intention to flow in a certain way to achieve a precise effect whilst simultaneously allowing said author to sound of much higher mind and thought to the reader." - I probably butchered it but my point is that AI slop seems to be the average of the outputs.
Images that look similar to others because they're average of all the current outputs. Same with music and video. We're noticing when something is AI because it has this signature that's average to other outputs.
Original content is crafted even though inspired by other works.
We're at a weird point where AI is capable but constrained.
As compute increases and AI becomes more personalised I feel the current implosion will explode again into variety.
1 days ago [-]
lubujackson 1 days ago [-]
[dead]
mpalmer 1 days ago [-]
What, you don't like your caveats to be honest?
trueno 1 days ago [-]
dead giveaway for me that something someone made and wants to share = they dont understand what they put together enough to like speak to it with some level of authority.
people can make some really useful stuff with AI especially when its a domain they're already an expert in, and it would go a long ways for them to just sit and explain that 1. they used AI to help 2. their own words to explain what the heck they put together, especially if they can speak to some of the limitations AI has working with it. just goes a long way to demonstrate this guys stuff is worth tinkering with because he has a good grasp on what was created
for 99% of the stuff out there now people are literally operating in domains they don't understand at all, i just close my tab when i see the damn vibe coded readmes
genxy 1 days ago [-]
This seems like a pricing hack that burns resources, that when the loophole gets closed the price of OCR will have to rise?
ricardobeat 1 days ago [-]
It’s not a loophole, it just happens that encoding information as optical tokens is much more efficient than text.
geor9e 1 days ago [-]
Step back and think about it another way - ask which scenario is more likely:
Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years. That trick being to rasterize 8bit characters into 8x8 pixels in a big image. 60% in a market worth trillions of dollars.
or
Anthropic's marketing team arbitrarily prices tokens to drive growth, according to vibes and feelings, and didn't think they needed to price images on par with text in their rush to burn cash & drive growth. Some folks take advantage of the trick during the first few days of the model's availability before Anthopic corrects their pricing, to align more proportionally with actual compute costs.
calebkaiser 1 days ago [-]
Nah, optical compression is a thing. You see it in a lot of different areas in ML. In this case, the "trick" has been known for a while, and belongs to a whole world of compression research. But I think where you're maybe getting mixed up is in where that 60% gain is coming from.
It's not a 60% percent reduction in cost for 100% of the same output. If you have a model and input text A, and you fix the seed etc. and run Text A through the model as text tokens and as compressed image tokens, you will not get identical outputs. You're specifically reducing the number of tensors needed to represent your input, which saves you on raw compute, but also by definition gives you less room to represent the information in your input. It's lossy, in other words.
Put another way, if you're using a model like Fable because you need the absolute frontier of capability and cheaper models cannot solve your tasks, then there is a very real chance that a compression strategy like this drops Fable's accuracy such that it's no longer suitable for your task. Which defeats the point of you paying for the most expensive model in the first place.
So, it's cool research. Might be useful for some people. Probably isn't something that has incredible utility in real use cases.
rightbyte 1 days ago [-]
> a compression strategy
To me compression implies smaller size? However new line chars seems to be removed in the pic so I guess it could be expressed in fewer bytes than the original text with further compression ...
yorwba 1 days ago [-]
The size is indeed smaller, because text tokens and image tokens are embedded as vectors of the same size, but text tokens typically only cover a few characters, while image tokens typically cover many pixels, so many that you can fit more characters in there. So the same text takes up fewer tokens as an image, and hence requires less time and memory to process.
You could also imagine models where text tokens cover many characters and image tokens just a few pixels, which would invert the relationship, but this is typically suboptimal for the applications people have in mind when they train a model.
jayd16 1 days ago [-]
So split the difference and start encoding input at the words or phrases level?
calebkaiser 1 days ago [-]
Lots of researchers have done just this! There's a really rich history of research + lots of contemporary work on different encoding/representation strategies. This might be interesting to you: https://sbert.net/
What makes the DeepSeek-OCR and related results exciting to some researchers is less about the fact that you could devise a tokenization scheme that has fewer tokens, and more about how well it works.
vineyardmike 1 days ago [-]
> Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years of multi-trillion dollar growth
DeepSeek published a pretty well circulated paper on exactly this many months ago. It just hasn’t been attempted and shared publicly, asa retrofit, AFAIK.
Also, it’s no free lunch, the readme indicates that this “use images” hack is lossy and reduces success rates alongside the reduced cost. Most labs would focus on success increases regardless of price.
geor9e 1 days ago [-]
If the trick were genuinely useful, and was well circulated months ago, the resource-starved inference providers would have squeezed this trick dry already, instead of wasting 60% of their tokens, waiting for users to implement it themselves in 5 minutes of effort.
Klathmon 1 days ago [-]
That's like saying quantization isn't real because the frontier labs aren't using it in their production inference.
This is a lossy process, it produces worse results. It might be worth it for some situations, but applying it to everything would just be making your SOTA model worse
3abiton 2 hours ago [-]
The "trick" is well documented in their Deepseek-OCR paper, that builds on plenty of other work. It's just not simple to just switch a commonly used LLM architecture to a new one, but I don't doubt most frontier labs are already experimenting with it. This by itself a very active field of research.
ptx 1 days ago [-]
Isn't this just quantization with extra steps? Can converting the text to an image really be a better way to lossily compress it? (Not that I have any idea what I'm talking about on this topic.)
numeri 1 hours ago [-]
No, quantization is applied to model weights or the KV cache (the model activations of all past tokens), and is just storing everything with lower precision (carefully, so that it doesn't hurt performance much).
Sending an image of text instead of text reduces the number of input tokens, but they're still being processed by the model at the same precision. This probably also hurts performance in some way – the question is by how much.
Klathmon 1 days ago [-]
I also have no idea what I'm talking about, but to me this seems closer to the "caveman mode" that some people use to compress info into fewer tokens. Going through the image tokenizer allows you to leave the source text untouched while still gaining (some of?) the benefits
solenoid0937 1 days ago [-]
[flagged]
satvikpendem 1 days ago [-]
An economist walks past a hundred dollar bill on the ground because someone would've picked it up already if it were real.
Aurornis 1 days ago [-]
I think you missed the part where this is a lossy technique that reduces performance.
The image trick reduces context because it’s lossy. The README says you can’t use it for anything needing exact recall. It produces a gist of the input.
You could achieve something similar by using a small, cheap model to pre-summarize information for the expensive LLM. This is what many people do already and it’s a much better way to do it for most situations.
jug 1 days ago [-]
Alternative 1 isn’t all that unlikely given Opus 4.8 couldn’t do this. So it’s a recently possible hack. Not something LLM corps have been blindsided by for years. I also strongly recommend RTFA in this case, namely ”The honest part, read before relying on it”
stevenhuang 1 days ago [-]
This has been known since VLMs were a thing, that more information can be encoded visually and token efficiency is increased. But it came with performance issues (more hallucinations, etc).
Also I don't think you realize how much dumb stuff is still left on the table. That the market is worth trillions is quite irrelevant here given the dynamism of the field.
guardiangod 1 days ago [-]
Truly a picture is worth a thousand words.
Salgat 1 days ago [-]
That's not what is happening. Claude isn't charging for the tokens it generates from the OCR on its side, but it's still processing the same number of tokens as if you had sent the text, just with the extra step of OCR on Claude's side. This is 100% a loophole that's burning extra resources.
ricardobeat 4 hours ago [-]
There is no OCR, in the traditional sense, involved.
DaiPlusPlus 1 days ago [-]
> encoding information as optical tokens
Educate me: what is an "optical token" when dealing with LLMs?
TZubiri 1 days ago [-]
Of course it isn't
A text encoding uses 8bits per character on average, tokenization further compresses that
An image font would be 25 bits if 5x5, and most fonts are 12 pixels high
Of course it isn't efficient, this is a pricing inefficiency and a hack to exploit it (even the author describes it as an exploit)
legel 1 days ago [-]
You are wrong.
Text tokens are high-dimensional vectors, not 8 bits per character. Every token has a deep embedding, e.g. 1024 float values per text token.
DeepSeek-OCR proved 10x+ compression from visual embedding of text, which was a groundbreaking result. [1]
Very cool to see OP's project hacking on this principle. It's still not lossless, as noted in the github, but is a promising research direction.
I kinda wonder if it's extracting usable context from 2D proximity between lines? Normal text input wouldn't have that kind of information (though it could, and it's arguably just a lookahead/behind of N characters on average).
deburo 1 days ago [-]
A token is probably not a single char, and an image is probably decomposed into tokens as well (and god knows how many tokens an image is decomposed into) which probably map to similar float-hungry vectors. Your counterargument could use a bit more flesh.
And we're talking about images of texts, not images that represent complex imagery such as a very detailed scene or what have you.
nextaccountic 1 days ago [-]
Well, then we could presumably also add lossy compression to texts, without passing through images first
gamblor956 1 days ago [-]
People really need to read their cites and not just the summaries.
The paper notes two things:
1) While the compression ratio for visual text is better than it is for regular text, but the absolute space required is still higher for the images. OPs were talking about the space required, not the ratio.
2) The results of the OCR must still be fed into a text-based LLM for linguistic processing. Otherwise, all you have achieved is turning an image into a bunch of text.
TZubiri 1 days ago [-]
>Text tokens are high-dimensional vectors,
You are conflating tokens with embeddings.
Tokens fit in a single word, modern gpt uses a vocabulary with 200k possible values, which would fit into 18 bits.
Have a good one
netsharc 1 days ago [-]
huh, what if the image encoding is 8 bits per R, G, B values of the pixel, then one can encode the same amount of text in less pixel dimensions (3 letters would need 1 pixel instead of three 12x12 pixels)
The top line can be the OCR-able instruction on how to decode the rest of the image, and the rest of the image would be random-looking colourful palette. It might not even need to use 8 bits per character, since ANSI is 7 bits/character.
TZubiri 1 days ago [-]
then it's no longer an image, as the one in the github repo, you would be encoding the text as characters and sending it as an image.
You can achieve this by changing the extension of an image file from .bmp to .txt
Guys, not to be mean, but maybe chill with the state of the art research and go back to studying fundamentals.
netsharc 3 hours ago [-]
If you want to be a know-it-all poseur, at least back it up with data.
vineyardmike 1 days ago [-]
[dead]
jrm4 1 days ago [-]
Anyone else laugh out loud when they read this? Like, okay so NO, that's entirely impossible. What's really going on?
1 days ago [-]
samrus 1 days ago [-]
Not really. They arent actually using more resources this way either. This might be a fundamental inefficiency thats being removed
It kinda makes sense too. Because while people do read code word by word, we often "glance over" it and do roughly pattern recognition on it to know what it does. Only homing in on something when we need to answer a specific question. I think humans kinda naturally do this exploit anyway
So, just be careful with this, it very likely is switching to other less capable model hence the cost reduction. So looks like Fable but isn’t. So you are doing extra work when you could just switch the model back to opus 4.8 instead.
g42gregory 1 days ago [-]
I think Oh-My-Pi (OMP.sh) uses images for context compactificaton. OMP is built on top of Pi coding agent.
Saw a Tweet a while ago from someone (maybe Carmack, maybe Geohot, maybe Karpathy?) wondering if images were just the better option.
Since then I've been using images with very simply worded prompts whenever I'm informing an agent of what is happening. Sometimes no text in the prompt at all.
It has been very very effective.
That being said, this isn't really what Karpathy was talking about. But it got me thinking a bit, and that got me to a much nicer workflow.
cindyllm 1 days ago [-]
[dead]
anigbrowl 1 days ago [-]
I'm sorry, but this is retarded. It works, and it's clever, but but it's clearly a workaround for a pricing failure. Much like the bounty on poisonous snakes leading to people taking up snake-breeding, this just exploits and promotes waste. I think ultimately blame falls on Anthropic for the poor pricing system the enables such arbitrage. But I'm also disgusted by the inevitable tide of people exploiting this until its fixed, and creating an entirely unnecessary extra tide of digital junk.
1 days ago [-]
brumar 1 days ago [-]
Tangentially related: I don't think OCR is the right term and I am generally vocal about that. But seeing this unquestioned here, I am wondering if I am the one who is wrong here. Is it ok to call this OCR? To me ocr means text in the end, not visual tokens.
parsimo2010 1 days ago [-]
OCR means optical character recognition. The terms do not require a direct transcription, but that is mostly what OCR meant in the past. If you’re using an LLM’s vision capability to pass in text and the LLM actually understands it, then I would say that it recognized the characters, hence OCR seems okay to use.
TurdF3rguson 1 days ago [-]
It's not. OCR is not what the vision model is doing here. We're used to using OCR as a verb but it's more accurate to say the model "visioned" it.
Also, some models still do OCR and it's usually way more expensive that way.
devmor 1 days ago [-]
So if I OCR a document, edit it, and print it, OCR didn't happen?
__hugues 1 days ago [-]
seems really dumb and like it would need to violate basic information theory to work?
input tokens are cheaper than output tokens. seems like it would maybe reduce input tokens at the expense of many more output tokens if you're actually triggering OCR via thinking?
sachamorard 14 hours ago [-]
It's far from being a foolish idea, and it seems to me that the project correctly documents its own limitations. It's rare to see a README list so precisely where the tool falls short.
Then, yes, input tokens are cheaper than output. But when it comes to find ways to get reduce costs, you have to explore all the options.
What about: "Read this document online : [URL]" and you add your text/context to an online document?
Would that reduce the number of tokens used too?
mrbnprck 1 days ago [-]
Documents are processed as tokens as well, unless its bitmap is ocr'd.
Images tho are natively compatible with Multi-Modal LLMs, so theres no image->text translation layer in between.
It's that the unit of cost is different (e.g. "visual token" vs text token)
electrotype 1 days ago [-]
I see. I was thinking that it might be different if the document wasn't provided by you directly, but instead if the LLM fetched it itself online.
tru3_power 1 days ago [-]
This probably works with PDF parsing as well I’m sure, even if it’s just from not having to parse pdf format alone.
npn 18 hours ago [-]
it is funny because nobody ever bother points out that they overcharge you for text input token price.
sure it was pretty resource intensity a few years before, but with turbo quant, sparse attention and various techniques, plus the advancing of hardware (dedicated prefill machine, memory pool for kv caching) the cost should be drastically reduced, and yet they still keep the same cost formula.
I can't help but laugh whenever someone proudly share how many billion input tokens they spent in their code sections and how much they saved with the subscription, meanwhile it is pretty much just electricity cost for the providers.
chickensong 1 days ago [-]
Binary compression unpacked by OCR? This is the stuff of nightmares. So cursed, and yet...
puppycodes 1 days ago [-]
That is hilarious and an amazing find.
shinryuu 1 days ago [-]
Interesting approach, though that readme really needs a rewrite by a human...
dippogriff 1 days ago [-]
I want to see more text-free foundation models
odo1242 1 days ago [-]
If the model has to be able to output text it needs to be able to read text tho
nickpeterson 1 days ago [-]
a pictures worth a thousand tokens
lstroud 1 days ago [-]
Are we really re-discovering that compressed binary formats are more efficient data representations?
solid_fuel 1 days ago [-]
No?
The image is still getting run through OCR and turned back into text before being fed into the LLM. There is no efficiency gain here, rather we have learned that Anthropic is applying a discount to text fed in via OCR.
qingcharles 1 days ago [-]
I don't think this is what is happening, IMO. The models can genuinely "read" the text off the images, but usually at a less-than-perfect ratio, and it uses less tokens for the model on visual input than it does actually using OCR to convert them into text and then sending that in. I do not think there is any intermediate stage where they are applying a free OCR in this situation. (I realize that happens in some situations)
So my guess is that Claude’s backend is doing the same — so this hack is probably more of a loophole in token accounting that might get closed if Claude is doing what Gemini does
But then there is a comment below talking about how DeepSeek was able to get a huge improvement in compression by using visual tokens, https://news.ycombinator.com/item?id=48777848. I don't fully understand all of the underlying technical details so I am still fundamentally baffled about how going the OCR route could actually result in overall electricity/computational savings.
Whether such lossy compression is acceptable for your use case is up to you.
An image token I recall is something like 16x16, so you get 32 bytes of overhead per pixel. And a character is minimally like 20 pixels including the whitespace, so you've jumped from 4 characters per token to maybe 12.
So 3x savings... which actually maps pretty closely to 60% savings.
It's forced by the nature of how LLMs use vector embeddings for language.
Basically, a single token in a LLM is represented as a n-element vector, where n is the "hidden dimension", also known as model dimension. In order for the model to be smart, the hidden dimension needs to be large, on the order of 2^16 on top-tier models. Elements of this vector are typically quantized to 2-byte floats, or sometimes smaller. Every possible fact is embedded as a direction in this very many dimensional vector space, and a token is related to a fact if the vector representing that token points into a similar direction as that fact. You can do vector math about these things, famously for most trained models, if you find the vector embedding for king, man, woman and queen, and calculate king - man + woman, the result is very close to queen.
(Does that mean that there are 2^16 possible different kinds facts about things in this model? No, because high-dimensional geometry is very unintuitively powerful. The facts are not axis-aligned, and they don't need to be perfectly non-orthogonal. This matters, because the numbers of individual vectors you can fit into a single 2^16 dimensional space that are orthogonal with each other (all angles 90degrees) is of course 2^16. But, if you allow for almost orthogonal vectors, the number is larger than the amount of atoms in the universe. If this sounds wacky, for people with a CS background it can help to think it working a bit like a bloom filter, in that collisions are possible. Although in actuality they are theoretical, because 2^16 is a very large number.)
Multi-modal models do actually natively tokenize images, though. So it doesn’t have to be converted to text for it to work. They may do it anyway for accuracy, but it’s not at all required.
Effectively an image is scaled to a standard size, rasterized / cut up, and each cut is assigned a separate token, much in the same way text is tokenized. Train the model on this as well and you’ll end up having a model that can understand images.
I asked Gemini how to save costs and it said just send in all the images of the pages instead. Instinctively, as a developer, it's hard to fathom how sending 200 images is cheaper than sending the text, but it definitely works.
It'd be weird if they were doing this, since it would mean the context window size was a lie and that the API would presumably reject requests whose expanded form went over the 1m limit. For someone using pxpipe with an effective context compression of 90% in some instances, it'd hit the limit at barely 100k.
Exact details of text to image compression ratios are of course extremely dependent on the model architecture, training data, training objectives, etc., so there's probably not too much justification for generalizing to all models
Edit: didn’t realize this occurred on local models(!!),
this is smarter https://news.ycombinator.com/item?id=48779884
One option, when an image is fed into an LLM, is to divide it into tiles, then those tiles pass through a 'vision encoder' neural network to make 'vision tokens' which are then input into the LLM much like text tokens are. Obviously you train the vision encoder and LLM to understand one another. This is known as an 'end-to-end OCR model'.
And it turns out, once you've trained a model to do this, you can vary the number of 'vision tokens' used to represent a given text document by scaling an image of a document up or down, and see what happens. You also get a load of other parameters like patch size and vision encoder complexity and so on.
Turns out it works really well; in some tests they used 90% fewer input tokens, but still got 97% output performance.
[1] https://arxiv.org/abs/2510.18234
For example:
> Honest caveat, visible in the clip: the pxpipe arm answered the count first and needed one follow-up nudge to also print the ledger balance in the requested one-line format; the plain arm followed the format on the first try. Legibility is solved on Fable — single-reply format compliance is the remaining rough edge.
If I reread this four times, I can sort of interpolate what happened, but it’s mostly pointless and confusing information.
In my experience all models do this to an extent, but Claude seems to be the worst at this. GPT 5.5 is a bit more terse but seems to compress more valuable information.
My guess is that it's a known problem, which steered the frontier models into bullet point preference.
To be fair, as you can see in the clip, the two models handled the prompt slightly differently. The pxpipe variant gave the right count initially but needed a quick follow-up to output the ledger balance in a single line. The standard model, on the other hand, nailed the formatting on its first try. We've completely solved readability here on Fable; our only real hurdle left is getting the models to follow formatting constraints perfectly on the very first reply.
Of course, this was just rewritten by another LLM.
A human might have written a disclaimer like this:
> When not using Fable, pxpipe may require additional follow-ups to precisely follow your formatting instructions.
This kind of garbled information dump is very inconsiderate of the reader, and all good writing is considerate of the audience consuming it.
That human rewrite is excellent. It ruthlessly cuts out the "narrative" of the test case (the transaction counts, the video clip, the "first try vs second try" details) and extracts the only piece of information an actual user reading a README cares about: what to expect when using the software.
Which suggests some ideas that should have been included in the prompt, to get closer to your ideal rewrite.
Most LLMs by default seem to write both text and code with low information density.
You can kind of get around it by prompting them to optimize for compactness, but most just let it run with a more generic prompt.
"Elegant prose instantiated through remarkably tailored execution of written word may allow an author's desired intention to flow in a certain way to achieve a precise effect whilst simultaneously allowing said author to sound of much higher mind and thought to the reader." - I probably butchered it but my point is that AI slop seems to be the average of the outputs.
Images that look similar to others because they're average of all the current outputs. Same with music and video. We're noticing when something is AI because it has this signature that's average to other outputs.
Original content is crafted even though inspired by other works.
We're at a weird point where AI is capable but constrained.
As compute increases and AI becomes more personalised I feel the current implosion will explode again into variety.
people can make some really useful stuff with AI especially when its a domain they're already an expert in, and it would go a long ways for them to just sit and explain that 1. they used AI to help 2. their own words to explain what the heck they put together, especially if they can speak to some of the limitations AI has working with it. just goes a long way to demonstrate this guys stuff is worth tinkering with because he has a good grasp on what was created
for 99% of the stuff out there now people are literally operating in domains they don't understand at all, i just close my tab when i see the damn vibe coded readmes
Some random person discovered a 60% across the board gain in all LLMs, using an extremely simple trick that none of the labs noticed in all these years. That trick being to rasterize 8bit characters into 8x8 pixels in a big image. 60% in a market worth trillions of dollars.
or
Anthropic's marketing team arbitrarily prices tokens to drive growth, according to vibes and feelings, and didn't think they needed to price images on par with text in their rush to burn cash & drive growth. Some folks take advantage of the trick during the first few days of the model's availability before Anthopic corrects their pricing, to align more proportionally with actual compute costs.
It's not a 60% percent reduction in cost for 100% of the same output. If you have a model and input text A, and you fix the seed etc. and run Text A through the model as text tokens and as compressed image tokens, you will not get identical outputs. You're specifically reducing the number of tensors needed to represent your input, which saves you on raw compute, but also by definition gives you less room to represent the information in your input. It's lossy, in other words.
Put another way, if you're using a model like Fable because you need the absolute frontier of capability and cheaper models cannot solve your tasks, then there is a very real chance that a compression strategy like this drops Fable's accuracy such that it's no longer suitable for your task. Which defeats the point of you paying for the most expensive model in the first place.
So, it's cool research. Might be useful for some people. Probably isn't something that has incredible utility in real use cases.
To me compression implies smaller size? However new line chars seems to be removed in the pic so I guess it could be expressed in fewer bytes than the original text with further compression ...
You could also imagine models where text tokens cover many characters and image tokens just a few pixels, which would invert the relationship, but this is typically suboptimal for the applications people have in mind when they train a model.
What makes the DeepSeek-OCR and related results exciting to some researchers is less about the fact that you could devise a tokenization scheme that has fewer tokens, and more about how well it works.
DeepSeek published a pretty well circulated paper on exactly this many months ago. It just hasn’t been attempted and shared publicly, asa retrofit, AFAIK.
Also, it’s no free lunch, the readme indicates that this “use images” hack is lossy and reduces success rates alongside the reduced cost. Most labs would focus on success increases regardless of price.
This is a lossy process, it produces worse results. It might be worth it for some situations, but applying it to everything would just be making your SOTA model worse
Sending an image of text instead of text reduces the number of input tokens, but they're still being processed by the model at the same precision. This probably also hurts performance in some way – the question is by how much.
The image trick reduces context because it’s lossy. The README says you can’t use it for anything needing exact recall. It produces a gist of the input.
You could achieve something similar by using a small, cheap model to pre-summarize information for the expensive LLM. This is what many people do already and it’s a much better way to do it for most situations.
Also I don't think you realize how much dumb stuff is still left on the table. That the market is worth trillions is quite irrelevant here given the dynamism of the field.
Educate me: what is an "optical token" when dealing with LLMs?
A text encoding uses 8bits per character on average, tokenization further compresses that
An image font would be 25 bits if 5x5, and most fonts are 12 pixels high
Of course it isn't efficient, this is a pricing inefficiency and a hack to exploit it (even the author describes it as an exploit)
Text tokens are high-dimensional vectors, not 8 bits per character. Every token has a deep embedding, e.g. 1024 float values per text token.
DeepSeek-OCR proved 10x+ compression from visual embedding of text, which was a groundbreaking result. [1]
Very cool to see OP's project hacking on this principle. It's still not lossless, as noted in the github, but is a promising research direction.
[1] https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...
And we're talking about images of texts, not images that represent complex imagery such as a very detailed scene or what have you.
The paper notes two things:
1) While the compression ratio for visual text is better than it is for regular text, but the absolute space required is still higher for the images. OPs were talking about the space required, not the ratio.
2) The results of the OCR must still be fed into a text-based LLM for linguistic processing. Otherwise, all you have achieved is turning an image into a bunch of text.
You are conflating tokens with embeddings.
Tokens fit in a single word, modern gpt uses a vocabulary with 200k possible values, which would fit into 18 bits.
Have a good one
The top line can be the OCR-able instruction on how to decode the rest of the image, and the rest of the image would be random-looking colourful palette. It might not even need to use 8 bits per character, since ANSI is 7 bits/character.
You can achieve this by changing the extension of an image file from .bmp to .txt
Guys, not to be mean, but maybe chill with the state of the art research and go back to studying fundamentals.
It kinda makes sense too. Because while people do read code word by word, we often "glance over" it and do roughly pattern recognition on it to know what it does. Only homing in on something when we need to answer a specific question. I think humans kinda naturally do this exploit anyway
Since then I've been using images with very simply worded prompts whenever I'm informing an agent of what is happening. Sometimes no text in the prompt at all.
It has been very very effective.
That being said, this isn't really what Karpathy was talking about. But it got me thinking a bit, and that got me to a much nicer workflow.
Also, some models still do OCR and it's usually way more expensive that way.
input tokens are cheaper than output tokens. seems like it would maybe reduce input tokens at the expense of many more output tokens if you're actually triggering OCR via thinking?
Would that reduce the number of tokens used too?
Images tho are natively compatible with Multi-Modal LLMs, so theres no image->text translation layer in between. It's that the unit of cost is different (e.g. "visual token" vs text token)
sure it was pretty resource intensity a few years before, but with turbo quant, sparse attention and various techniques, plus the advancing of hardware (dedicated prefill machine, memory pool for kv caching) the cost should be drastically reduced, and yet they still keep the same cost formula.
I can't help but laugh whenever someone proudly share how many billion input tokens they spent in their code sections and how much they saved with the subscription, meanwhile it is pretty much just electricity cost for the providers.
The image is still getting run through OCR and turned back into text before being fed into the LLM. There is no efficiency gain here, rather we have learned that Anthropic is applying a discount to text fed in via OCR.