Perplexity AI Open-Sources Unigram Tokenizer That Achieves 5x Lower p50 Latency Than Hugging Face tokenizers Crate
Perplexity AI open-sources a rewritten Unigram tokenizer that reduces reranker latency and cuts production CPU utilization by 5-6x.
Technology
Models, breakthroughs, and the race to AGI
AI moves faster than any single feed can keep up with. Frontier model releases, new benchmarks, capability scares, regulation moves, and the steady drip of papers that actually matter — the signal-to-noise ratio is brutal, and most coverage is either uncritical hype or reflexive doomerism. Owl Post reads across hundreds of sources every day, filters out the takes that don't pass smell tests, and surfaces what genuinely shifted: model releases worth paying attention to, capability jumps with real-world implications, and policy moves with teeth.
The voice you read it in is yours. Pick a deep, contextualized voice if you want explanations that respect a smart audience without dumbing down. Pick a measured, analytical voice if you want context and nuance over hot takes. Pick a sober, no-hype voice if you want the analyst's read on what's real. Same news, the way you actually like to read it.
Three to five stories every weekday morning. Written in your voice. In your inbox. In 3 minutes.
I recently came across a statistic that really hit home: 82.6% of phishing emails now use AI in some form (VIPRE/Keepnet, 2025). As a developer who's constantly sharing code snippets, assets, and documentation, this instantly made me think about one of our most common daily activities: file sharing. It's a key attack vector, and the rise of AI makes it more insidious than ever. I've spent countless hours building tools and systems, and like many of you, I've had my share of "oops" moments when it comes to security. This isn't just a theoretical problem; it's a very real and present danger in our development workflows. We're often caught between the need for speed and convenience, and the imperative of robust security. But with AI in the mix, the stakes have just gotten a lot higher. I want to share some insights from my perspective on navigating this new, more hostile landscape. The days of easily spotted grammatical errors and generic "Dear Sir/Madam" phishing emails are rapidly fading. AI has revolutionized the sophistication of these attacks. We're talking about: Hyper-personalization: AI can scour public data, social media, and even leaked databases to craft highly convincing, personalized emails and messages. They know who you are, who you work with, and what your projects might be. Flawless language: Gone are the linguistic tells. AI-generated phishing content is often grammatically perfect, contextually relevant, and indistinguishable from legitimate communication. Deepfakes & voice mimicry: Beyond text, AI is enabling convincing deepfake videos and audio, making it harder to verify the identity of someone requesting a file or access. For developers, this means the shared-design.zip or project-specs-update.pdf you receive could be a Trojan horse, carefully crafted to appear legitimate. The embedded script, the malicious macro, or even just the metadata could be the entry point for an attacker. It's no longer just about clicking a dodgy link; it's about the fi
We run an AI companion bot. Every chat turn, the model sees the same ~5K-token prefix — character persona, content-tier rules, formatting guardrails, a memory blob — plus one new user line. Without caching, we pay for those 5K input tokens every single turn. So we turned on prompt caching across the providers we route through, measured it, and the spread was bigger than any of the marketing pages prepared us for. Here's the table that survived four weeks in production, plus the one gotcha that ate two weeks before we figured it out. Provider / model Hit rate Latency Δ Notes Cydonia (via OpenRouter) 91 % −43 % Just works, no marker needed Gemini 3.1 Flash Lite 75 % −49 % Requires cache_control marker Grok (xAI) 51 % −40 % "Sticky" — best on active sessions Same code, 600-token test prompt 0 % 0 % Methodology bug — see below Same exact 5K-token system prefix across all rows. Same 10 follow-up turns. Wildly different cache behaviour. Most OpenAI-compat examples skip any cache hint and assume the provider figures it out from prefix repetition. Some do. Anthropic-style routes — and anything going through OpenRouter that supports cache_control — don't: messages = [ { "role": "system", "content": [ { "type": "text", "text": SYSTEM_PROMPT, # the long, stable prefix "cache_control": {"type": "ephemeral"}, } ], }, {"role": "user", "content": user_msg}, # the only volatile part ] Cydonia caches without it. Grok caches without it. Gemini 3.1 Flash Lite caches at exactly 0 % without it. The same model jumps to 75 % with one extra field on the last cacheable content block. We had Gemini 3.1 routed in production for a week showing zero cache reads in usage. Concluded the model "just didn't support caching." It does — we were calling the API the way every other model wanted to be called. Cost of including the marker on providers that ignore it: zero. Cost of skipping it on a provider that needs it: your entire spend on that route. Before we caught the marker thing, we'd already wro
Owl Post delivers a personalized ai & machine learning digest every morning, curated by AI, written in your voice.
Get your free digest