By Enric Domingo Domènech (ERNI Spain)
Artificial intelligence has spent the last decade mastering language, images and data, but the next frontier is far more tangible. Physical AI refers to systems that can perceive, reason and act in the real world, understanding constraints like gravity, friction and causality. As NVIDIA defines it, it’s about bringing intelligence out of the screen and into the physical environment.
While robotics and automation have existed for years (at ERNI, we have developed several projects involving machine learning models that control and assist machines and robots), what is changing now is the level of generalisation and accessibility. We are moving from rigid, pre-programmed machines to adaptable systems that can interpret instructions, learn from experience and execute tasks in dynamic environments. This shift marks a critical transition: AI is no longer just about generating answers and images (tasks in the digital space); it is increasingly about executing actions (physical inputs and physical outputs from the AI models).
Natural language as a universal interface
One of the key enablers of this transformation is the combination of voice-based AI agents with tools and robotic systems. Natural language is becoming a universal interface for controlling physical devices. Instead of programming robots through complex scripts, users can now interact with them conversationally. In practice, this means building systems where a voice agent understands intent, selects the appropriate tools, and orchestrates actions in the physical world.
At ERNI, we have illustrated this in a demo through a voice-controlled drawing robot (‘Fulgencio’), where spoken instructions are translated into coordinated robotic movements. Architectures for voice AI agents such as the so-called ‘voice sandwich’ or real-time all-in-one pipelines demonstrate how speech recognition, reasoning and actuation can be tightly integrated. The result is a more intuitive human-machine interaction model – one that lowers barriers and opens new possibilities across industries, from manufacturing to healthcare.
A new era of robot learning
At the same time, advances in robot learning are dramatically reducing the effort required to train these systems. Frameworks like Hugging Face’s LeRobot and NVIDIA Isaac aim to make robotics as accessible as natural language processing or computer vision. Instead of requiring massive datasets or complex engineering pipelines, developers can now train robots through imitation learning: demonstrating a task a few dozen times and allowing the system to learn from those examples. This end-to-end approach – teleoperating a robot, recording data, training a policy, and deploying it – can be executed with relatively low-cost hardware, such as open-source robotic arms that can be assembled and calibrated in hours.
Even more importantly, a new generation of Vision-Language-Action (VLA) models is emerging. These models leverage pre-trained knowledge (similar to large language models) and adapt it to physical tasks, enabling better generalisation and reducing the need for extensive task-specific data.
The ‘ChatGPT moment’ for physical AI
All these trends point towards a broader inflection point. Physical AI is entering what could be described as its ‘ChatGPT moment’, where foundational models and accessible tooling converge to accelerate adoption. Voice agents combined with tools are enabling natural language control of physical systems, while learning-based approaches are making robots faster to train and more adaptable. The gap between ‘AI that talks’ and ‘AI that acts’ is closing rapidly.
What it means for businesses
For companies, this represents both an opportunity and a challenge: rethinking how automation, user interfaces and intelligent systems are designed. The future is not just digital; it is also physical, embodied in systems like robotics as well as autonomous cars, drones and other specialised machines. And as these technologies mature, the organisations that experiment early will be best positioned to turn intelligent conversations into real-world impact.