<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Robotics on Ndukwe Chidubem</title><link>https://duks31.github.io/tags/robotics/</link><description>Recent content in Robotics on Ndukwe Chidubem</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 26 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://duks31.github.io/tags/robotics/index.xml" rel="self" type="application/rss+xml"/><item><title>Vision Language Action (VLA) Models</title><link>https://duks31.github.io/posts/2026-03-26-vla/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://duks31.github.io/posts/2026-03-26-vla/</guid><description>&lt;p>&lt;img src="../../images/vla.png" alt="VLA">&lt;/p>
&lt;h3 id="from-pixels-to-poses">
 From Pixels to Poses
 &lt;a class="heading-link" href="#from-pixels-to-poses">
 &lt;i class="fa fa-link" aria-hidden="true" title="Link to heading">&lt;/i>
 &lt;span class="sr-only">Link to heading&lt;/span>
 &lt;/a>
&lt;/h3>
&lt;p>Vision Language Action Models are large transformer models which predict robot actions, given observations from multiple cameras. They are usually trained on some pretrained vision language models, which makes them have a lot of knowledge of the world baked into them. VLA models bridge the gap between perception and physical movement. Unlike traditional modular pipelines, a VLA is typically end-to-end.&lt;/p></description></item></channel></rss>