Create README.md
Browse files---
license: mit
language:
- en
base_model:
- microsoft/Florence-2-large
pipeline_tag: robotics
tags:
- VLA
- LIBERO
- Robotics
- Flow
---
# FlowerVLA - Vision-Language-Action Flow Model pretrained on an OXE Split
This is a pretrained FlowerVLA model for robotic manipulation trained on a subset of OXE weights.
Flower is an efficient Vision-Language-Action Flow policy for robot learning that only contains 1B parameters and achieves SOTA on benchmarks like CALVIN while requiring very low memory for efficient inference on low budget hardware setups.
## Model Description
FlowerVLA is a novel architecture that:
- Uses half of Florence-2 for multi-modal vision-language encoding
- Employs an novel transformer-based flow matching architecture
- Provides an efficient, versatile VLA policy with only ~1B parameters
## Model Performance
Check out the finetuned checkpoints of FLOWER for LIBERO, CALVIN and more.
### Input/Output Specifications
#### Inputs
- RGB Static Camera: `(B, T, 3, H, W)` tensor
- RGB Gripper Camera: `(B, T, 3, H, W)` tensor
- Language Instructions: Text strings
#### Outputs
- Action Space: `(B, T, 7)` tensor representing delta EEF actions
## Usage
Check out our full model implementation on Github [todo]() and follow the instructions in the readme to test the model on one of the environments.
```python
obs = {
"rgb_obs": {
"rgb_static": static_image,
"rgb_gripper": gripper_image
}
}
90 = {"lang_text": "pick up the blue cube"}
action = model.step(obs, 90)
```
## Training Details
### Configuration
- **Optimizer**: AdamW
- **Learning Rate**: 2e-5
- **Weight Decay**: 0.05
@inproceedings{
reuss2025flower,
# Add citation when available
}
## License
This model is released under the MIT license.