Robots Finally Learn to Improvise
The Embodied Intelligence Breakthrough
Robots finally learn to improvise—and the companies that solve safety constraints will capture the entire market.
The robotics stack just got rewritten from scratch.
Not iteratively.
Not gradually.
Overnight.
February 2025: Figure AI unveils Helix, the first Vision-Language-Action model controlling the entire upper body of a humanoid robot at 200Hz. Ask it to “fold that laundry,” point at a pile of clothes it’s never seen, and watch it execute. No task-specific programming. No hardcoded trajectories. Just multimodal reasoning converting natural language into continuous motor commands.
November 2024: Physical Intelligence releases π0, demonstrating robots folding laundry, assembling boxes, and bussing tables using a single generalist model trained on diverse data from 7 robot configurations. The model outputs actions at 50Hz through flow-matching architecture instead of autoregressive tokenization. June 2025: Google DeepMind ships Gemini Robotics On-Device, enabling robots to fold origami and play cards with dexterity approaching human performance.
The inflection point arrived faster than anyone predicted. Vision-Language-Action models aren’t incremental improvements to robotic control. They’re architectural replacements that make the previous decade of robotics research obsolete.
Welcome to the world where robots see, understand, and improvise—and the only thing standing between trillion-dollar markets and catastrophic failures is middleware nobody’s built yet.
The 45-Model Evolution From Science Fiction to Production Hardware
From 2022 to 2025, researchers released 45 specialized VLA systems, evolving from foundation models to dual-system architectures with 10ms latency for low-level control. The progression wasn’t linear—it was exponential.
2022-2023: Foundation Era Google’s RT-2 coined the term “VLA,” demonstrating that robots could treat actions as another language. Train a vision-language model on internet-scale data, fine-tune with robot demonstrations, represent actions as text tokens. Simple. Elegant. Limited to 5Hz control frequencies and parallel grippers.
The limitations were crushing. Text tokenization caps action frequency. Autoregressive generation creates latency bottlenecks. Low degrees-of-freedom work for industrial pick-and-place but fail spectacularly for anything requiring dexterity.
2023-2024: The Architecture Split Two paradigms emerged: early fusion and dual-system designs.
UC Berkeley’s Octo model introduced an open-source approach with 93M parameters and diffusion decoders, trained on 800,000 robot demonstrations from the OpenX-Embodiment Dataset. Compact. Community-driven. Still too slow for real-time dexterous manipulation.
ICLR 2025’s EF-VLA model demonstrated that fusing vision and language representations early—before action prediction—preserves semantic consistency from CLIP pretraining, yielding 20% performance improvement on compositional manipulation tasks and 85% success on unseen goal descriptions.
2025: The Breakthrough Year Dual-system architectures solved the speed-generalization tradeoff that plagued every previous approach.
NVIDIA’s GR00T N1 and Figure AI’s Helix implement System 1 (fast diffusion policies at 10ms latency for low-level control) paired with System 2 (LLM-based planners for high-level task decomposition). Think fast, act faster.
Helix represents the first VLA outputting high-rate continuous control of the entire humanoid upper body including wrists, torso, head, and individual fingers, operating simultaneously on two robots to enable collaborative manipulation. It runs entirely onboard embedded low-power-consumption GPUs, making it immediately ready for commercial deployment.
Stanford’s OpenVLA democratized the technology with a 7B-parameter open-source model trained on 970,000 real-world robot demonstrations, outperforming closed models like RT-2-X (55B parameters) by 16.5% absolute task success rate with 7x fewer parameters.
The architectural convergence happened fast. Every competitive VLA now uses pre-trained vision-language backbones, flow-matching or diffusion for continuous actions, and cross-embodiment training data. The recipe works. The question shifts from “can we build generalist robots?” to “can we deploy them safely?”
Listen to our partner podcast episodes about the most interesting AI developments happening right now!!! Latest episode is here:
The $60M Burnout: What Happens When You Sell Your Soul to the AI Gods
Want to have a chat about future of AI? Your idea, project or startup with a world recognized AI expert and Startup Builder?
Book here your 15 minutes: https://calendly.com/indigi/jf-ai
The Market Reality: $280 Billion By 2034
The global advanced robotics market surpassed $44.74 billion in 2024 and is projected to reach $280.01 billion by 2034, growing at 20.13% CAGR. But drilling into segments reveals where VLA models create immediate value:
Industrial robotics: The incumbent disruption Industrial robotics will reach $35 billion by 2030 from $17 billion in 2024, growing at 14% CAGR. Traditional programmable robots dominate manufacturing automation today. VLA models threaten this entire category.
Consider the implementation delta. Traditional approach: hire integrators, program task-specific trajectories, spend 6-18 months deploying a single use case. VLA approach: collect demonstrations, fine-tune foundation models, deploy across multiple tasks in weeks. The unit economics flip. The moat disappears.
China accounts for 38% of global robot sales, with 276,300 industrial robots installed in 2023—six times more than Japan and 7.3 times more than the United States. This manufacturing capacity advantage becomes exponentially larger if China’s VLA research (which produced 15 notable models in 2024) catches American performance.
Collaborative robots: The growth engine Collaborative robot market will reach $12 billion by 2030 from $2 billion in 2024, growing at 35% CAGR. Cobots are expected to account for nearly 35% of all robot sales by 2027.
VLA models make cobots actually useful. Current generation collaborative robots require safety cages or reduced speeds. VLA models with learned safety constraints can predict human motion, adapt trajectories in real-time, and maintain productivity while ensuring safety. This unlocks small-batch manufacturing, direct human-robot collaboration, and deployment scenarios impossible with traditional systems.
Medical robotics: The premium segment Global medical robotic systems market was estimated at $25.56 billion in 2023 and is projected to reach $76.45 billion by 2030, growing at 16.55% CAGR. Surgical robotics dominates today with Intuitive Surgical’s da Vinci platform capturing majority market share.
VLA models enable a new category: autonomous clinical assistance. Not surgery (regulatory nightmares), but medication delivery, patient monitoring, sterile environment maintenance, sample transportation. Cleanroom robots are projected to grow fastest from 2024 to 2030, automating tasks in sterile environments while minimizing contamination risks.
The US medical robotics market stands at $8.8 billion (31% of global total) and is predicted to grow to $30 billion by 2033. Asia Pacific, led by China at $2 billion currently growing at 22.95% CAGR, will overtake North America by 2033, reaching $45.6 billion.
Logistics robotics: The immediate opportunity Logistics robotics market will reach $44.56 billion by 2034 from $10.21 billion in 2024, growing at 15.88% CAGR. Asia Pacific held the largest market share of 35% in 2024, with China’s robust e-commerce sector driving demand.
DHL reported a 25% productivity increase after integrating robots into warehouses, with approximately 80% of US warehouses expected to adopt robotics and automation by 2025.
VLA models solve the warehouse manipulation problem that’s plagued logistics automation for decades. Current systems: excellent at moving standardized boxes, terrible at everything else. VLA models: handle arbitrary objects, adapt to packaging variations, understand natural language instructions for exception handling.
The hidden multiplier: Humanoids Humanoid robot market will reach $18 billion by 2030 from $2 billion in 2024, growing at 40% CAGR—the fastest growth rate of any robotics segment.
Why humanoids? The world’s built for human form factors. VLA models finally make humanoids viable because they can generalize across the enormous action space humanoid platforms require. Figure plans to ship 100,000 humanoids over the next four years.
The Funding Frenzy: $41 Billion in Twelve Months
The capital flowing into embodied AI companies reveals conviction that VLA models work:
Physical Intelligence: From zero to unicorn in eight months Physical Intelligence raised $470 million total—$70 million seed in March 2024, then $400 million Series A in November 2024 at $2.4 billion valuation, led by Jeff Bezos, Thrive Capital, and Lux Capital. By September 2025, the company entered talks for additional funding at $5 billion valuation—doubling again in ten months.
Founded by ex-Google robotics scientists and Berkeley professors, Physical Intelligence builds “brains for robots”—universal software running across multiple embodiments. Their π0 model demonstrates the value proposition: one foundation model, seven robot configurations, 68 tasks. The generalization promise investors are betting on.
Figure AI: The humanoid juggernaut Figure AI exceeded $1 billion in committed capital through Series C financing, reaching $39 billion post-money valuation—15x increase from the $2.6 billion valuation following its $675 million Series B in February 2024.
Led by Parkway Venture Capital with participation from Brookfield Asset Management, NVIDIA, Macquarie Capital, Intel Capital, LG Technology Ventures, Salesforce, T-Mobile Ventures, and Qualcomm Ventures, the round signals institutional conviction that humanoid platforms plus VLA models equal deployable products.
Figure ended its OpenAI collaboration in 2025, stating large language models are “getting smarter yet more commoditized,” and built Helix entirely in-house. Strategic independence matters when your differentiation is embodied intelligence architecture, not foundation models.
The competitive landscape tightens Skild AI raised $300 million Series A in July 2024 at $1.5 billion valuation, led by Coatue, Lightspeed Venture Partners, SoftBank Group and Bezos Expeditions. Skild builds brain models for various robots and tasks—direct Physical Intelligence competition.
Other significant raises: Collaborative Robotics ($100 million), Sanctuary AI (Accenture partnership plus Canadian funding), Intrinsic (using NVIDIA models), and Vayu (delivery robots). The pattern: every serious robotics company now either builds VLA models internally or partners with foundation model providers.
The funding velocity indicates we’ve crossed from research to commercialization. Investors aren’t betting on “will VLA models work?”—they’re betting on “who captures the market?”
The Safety Problem Nobody’s Solving
Here’s the uncomfortable reality destroying unit economics: VLA models demonstrate spectacular capabilities in controlled environments and catastrophic failures in edge cases. The gap between “works in demo” and “safe for deployment” is where companies die.
Trusting an LLM not to say bad words is one thing; trusting a robot not to hurt people or damage property in your own home is quite another. Language model guardrails use semantic filtering. Physical systems require kinematic constraints, collision avoidance, force limits, and fail-safe behaviors.
The architectural mismatch VLA models trained end-to-end learn implicit safety behaviors from demonstrations. This works until it doesn’t. When the model encounters situations outside its training distribution, there’s no explicit safety layer preventing catastrophic actions.
Traditional robot control systems use layered architectures: high-level planning, motion primitives, trajectory optimization, low-level controllers with hard constraints. Safe by construction. Rigid. Unable to generalize.
VLA models: end-to-end learned policies that map observations directly to actions. Generalizable. Flexible. Completely opaque about internal reasoning and safety considerations.
Nobody’s shipped production systems bridging this gap at scale.
The middleware opportunity Simple solutions exist: engineer robots with high backdrivability, passive compliance, or torque limits to enable safer execution; define safety policies in raw software and control signals, ordering the robot to limit potential actions to avoid objects or itself.
But these approaches either constrain performance (torque limits reduce manipulation capability) or require task-specific programming (defeating the generalization advantage VLA models provide).
The technical challenge: build middleware that accepts natural language commands, translates them through VLA models, and enforces safety constraints without sacrificing the flexibility that makes VLA models valuable.
Required capabilities:
Real-time constraint verification: Monitor VLA model outputs, reject actions violating safety bounds, request alternative trajectories
Human intention estimation: Predict human motion in shared workspaces, adapt robot behavior to avoid collisions
Multi-model arbitration: Run multiple VLA policies, select safe actions, override when necessary
Graceful degradation: Detect when the robot encounters out-of-distribution scenarios, invoke human oversight, collect data for future training
Audit trails: Log all decisions, actions, safety violations for compliance and continuous improvement
Robot and automation systems carry out increasingly sophisticated tasks in complex environments, placing stringent expectations on autonomy stacks to enforce operational constraints and safety guarantees. Many current applications incorporate perception uncertainty or semantic concepts in task specifications, and safety requirements may be specified implicitly, abstractly, or incompletely—”intangible” safety constraints not specified explicitly as mathematical expressions.
To achieve high levels of safety and ultimately trust, the robotic co-worker must meet the innate expectations of the humans it works with. Human expectations aren’t programmable. They’re learned through interaction. VLA models excel at learning from demonstration. Safety middleware must preserve this learning capability while guaranteeing constraint satisfaction.
The Embodi Opportunity: Middleware Capturing the Market
Consider the market structure forming:
Layer 1: Foundation model providers Physical Intelligence, Figure AI, Skild AI, Sanctuary AI building generalist VLA models. Capital-intensive. Winner-take-most dynamics. Three to five eventual survivors capturing billions in equity value.
Layer 2: Hardware manufacturers Boston Dynamics, ABB, KUKA, FANUC producing robot platforms. Established players with manufacturing expertise and distribution. Incremental improvement businesses with modest margins.
Layer 3: System integrators Traditional integrators adapting to VLA-powered systems. Fragmented. Local. Services businesses with low valuation multiples.
The missing layer: Safety middleware Nobody’s building the infrastructure connecting VLA models to deployment-ready systems at scale. This is where Embodi slots in—not competing with foundation models or hardware, but enabling both to reach production.
The technical moat Embodi’s differentiation comes from solving three hard problems simultaneously:
LLM-to-action translation with safety verification Accept natural language commands, use VLA models for high-level reasoning, enforce kinematic constraints, collision avoidance, and force limits in real-time. The translation layer includes learned safety policies that understand both task objectives and physical constraints.
Human-in-the-loop override architecture Deploy robots in supervised mode initially, collect intervention data when humans override unsafe actions, continuously fine-tune safety models, gradually reduce supervision as confidence increases. This creates a flywheel: more deployments generate more safety data, improving models, enabling broader deployment.
Cross-embodiment safety certification Build safety policies that generalize across robot platforms. The same middleware runs on Figure’s humanoids, ABB’s industrial arms, and Waymo’s delivery robots. Each deployment adds to the shared safety knowledge base.
The business model Enterprise SaaS with usage-based pricing. Integrators and manufacturers pay per robot-hour for certified safe operation. The incentive alignment works: customers only pay for robots actually working, Embodi captures value proportional to utilization.
Revenue model: $50-200/robot/month depending on criticality level. At scale (100,000 deployed robots), that’s $60-240 million annual recurring revenue from a single middleware layer.
Go-to-market through partnerships Don’t sell directly to end customers. Partner with Physical Intelligence, Figure AI, and established integrators. Become the default safety layer every VLA deployment requires.
The pitch to foundation model companies: “You build the intelligence. We make it safe to deploy. Together we capture the market faster than competitors can match our combined capabilities.”
The pitch to integrators: “Stop building custom safety systems for every deployment. Use our certified middleware, reduce implementation time from months to weeks, charge premium margins for guaranteed safe operation.”
Market timing: The narrow window Here’s why Embodi must move now: foundation model companies will eventually build safety layers internally. But they’re focused on model performance, not deployment infrastructure. The window exists while they’re scaling research, before they recognize safety middleware as strategic.
Physical Intelligence just raised $400 million. They’re hiring researchers, collecting robot data, training bigger models. They’re not hiring safety certification engineers or building multi-robot deployment infrastructure.
Figure AI raised $1 billion focused on scaling BotQ manufacturing to produce 12,000 humanoids annually. They’re building robots, not safety middleware.
The 12-18 month gap between “foundation models work in research” and “companies build comprehensive safety infrastructure” is when Embodi captures the market.
The Three Deployment Scenarios Determining Winners
Fast-forward to 2028. How does this play out?
Scenario one: Consolidated vertical integration Foundation model companies recognize safety middleware as strategic, build internally or acquire. Physical Intelligence launches “Pi-Safe,” Figure AI ships “Helix Guard.” Each platform has proprietary safety systems incompatible with competitors.
Result: fragmented market, high switching costs, slower overall adoption. Winners: established foundation model companies with capital for vertical integration. Losers: independent middleware startups, hardware manufacturers locked to specific platforms.
Scenario two: Open infrastructure layer Industry consortium (backed by hardware manufacturers fearing platform lock-in) funds open-source safety middleware. Becomes de facto standard through broad adoption, network effects, and continuous community improvement.
Result: commoditized safety infrastructure, differentiation shifts to foundation models and hardware. Winners: open-source contributors, hardware manufacturers regaining control. Losers: venture-backed middleware startups unable to monetize open solutions.
Scenario three: Enterprise middleware capture Embodi (or competitor) ships safety middleware that actually works across embodiments, establishes partnerships with major foundation model providers and integrators, captures deployment at scale before others can match the safety certification database.
Result: middleware layer becomes valuable independent business, generates licensing revenue from every VLA deployment. Winners: successful middleware company, foundation model providers accessing broader deployment through partnership. Losers: late-moving competitors unable to overcome safety data network effects.
The scenario determining which path we follow: regulatory intervention timeline. If governments mandate safety certification before widespread VLA deployment, the middleware layer becomes infrastructure-level valuable. If adoption proceeds unregulated until incidents force intervention, foundation model companies build safety internally to avoid external dependencies.
The Uncomfortable Prediction
VLA models work. The research questions are solved. OpenVLA outperformed state-of-the-art closed models with 7x fewer parameters, demonstrating that scale isn’t the constraint. Helix controls entire humanoid upper bodies at 200Hz while generalizing to thousands of novel objects. π0 uses flow-matching architecture enabling 50Hz continuous control for highly dexterous tasks.
The technical barriers fell. The capital arrived. The market exists.
What’s missing? The unglamorous infrastructure making deployment safe enough for insurance underwriters, risk-averse manufacturers, and regulatory compliance departments.
Safety middleware isn’t sexy. It’s not foundation models. It’s not humanoids walking through warehouses. It’s the boring layer that determines whether the $280 billion robotics market reaches $2 trillion or stalls at sub-scale deployment.
There will be nearly 13 million robots in circulation by 2030. Robotics and automation are estimated to save businesses $8 trillion in labor costs by 2030. This doesn’t happen without solving safety.
The winners in embodied AI won’t be the companies with the best VLA models. They’ll be the companies that make VLA models safe enough to deploy at scale.
Build accordingly.
The bottom line: VLA models solved the generalization problem. Nobody’s solved the deployment problem. The company that ships safety middleware capturing enterprise trust wins the entire market—not through better AI, but through boring infrastructure that makes AI deployable. Every dollar flowing into foundation models creates ten dollars of demand for safety infrastructure. The gold rush is here. The pickaxe business remains unbuilt.


