AI & ML interests
Hugging Face Inference Endpoints Images repository allows AI Builders to collaborate and engage creating awesome inference deployments
Recent Activity
View all activity
Post
258
Before my vacation: Qwen releasing.
When I came back: Qwen still releasing
Respect!!🫡
Meet Qwen Image Edit 🔥 the image editing version of Qwen-Image by
@Alibaba_Qwen
Qwen/Qwen-Image-Edit
✨ Apache 2.0
✨ Semantic + Appearance Editing: rotate, restyle, add/remove 🎨
✨ Precise Text Editing → edit CN/EN text, keep style
When I came back: Qwen still releasing
Respect!!🫡
Meet Qwen Image Edit 🔥 the image editing version of Qwen-Image by
@Alibaba_Qwen
Qwen/Qwen-Image-Edit
✨ Apache 2.0
✨ Semantic + Appearance Editing: rotate, restyle, add/remove 🎨
✨ Precise Text Editing → edit CN/EN text, keep style
Post
3269
Thread to gossip during the
openai
GPT-5 livestream: https://www.youtube.com/watch?v=0Uu_VJeVVfo. Feel free to post your impressions below!
Post
2033
You would've implemented the 3-loop matrix multiplication many times as a ML practitioner, but the naive implementation is terrible for GPU performance. Modern GPUs achieve peak performance through careful memory access patterns and minimizing scheduling overhead.
In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed.
Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles.
The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future!
Code snippet for testing: https://gist.github.com/a-r-r-o-w/28339b442d164084506c0967029968a8
(Bonus: Since I've wanted to learn Manim for a while, this was a great opportunity to make a visualization for Naive VS Persistent matmul. Enjoy ✨)
In naive matmul (MxK . KxN), the computation happens in tiles - both for the output matrix and for how you read chunks from the input matrices. Each thread-block processes one output tile by loading corresponding tiles from input (for sum-reduction across K dimension), performing the computation, then terminating. The GPU launches many thread-blocks and schedules them across available streaming multiprocessors (SMs). When an SM finishes one tile, it gets assigned a new thread-block for the next uncomputed tile. This way, multiple output tiles are computed in parallel across the SMs, but we pay the cost for launching thread-blocks each time a new tile is computed.
Persistent matmul changes this approach. Instead of launching thread-blocks to compute some output tiles, computing the results on SMs in parallel, and repeating until all output tiles are computed, you launch only as many thread-blocks as you have SMs available (typically 80-132 on modern GPUs). These thread-blocks stay alive until all output tiles are computed, looping through multiple tiles sequentially. Each persistent thread-block may handle multiple output tiles.
The key benefit is the reduced thread-block launch latency. This persistence strategy, combined with other optimizations like coalesced memory loads/stores, block-tiling, warp-tiling, warp-specialization, double-buffering, ping-pong scheduling and other tricks, helps achieve peak performance. More on this in the future!
Code snippet for testing: https://gist.github.com/a-r-r-o-w/28339b442d164084506c0967029968a8
(Bonus: Since I've wanted to learn Manim for a while, this was a great opportunity to make a visualization for Naive VS Persistent matmul. Enjoy ✨)
Post
1210
🔥 July highlights from Chinese AI community
zh-ai-community/july-2025-open-works-from-the-chinese-community-686586f1a8840797e477ae5a
✨ Another "DeepSeek moment" - Kimi K2 🙌
✨ Qwen goes fully matrixed - Instruct / Thinking / Coder models across 30B - 480B 🤯
✨ The multimodal wave🌊
- GLM-4.1V-Thinking: Image+Text > Text
- Intern-S1: Image+Text > Text
- Wan 2.2 - Text +Image > video
- Skywork-R1V3: Image+Text > Text
- Skywork-UniPic: Text > Image / Image > Text
- Tar-7B: Any-to-Any
- Ming-Lite-Omni-1.5: Any-to-Any
- Step3: Image+Text > Text
- HunyuanWorld-1: Image > 3D
- ThinkSound: Video > Audio
- Neta-Lumina: Text > Image
✨Tiny & deployable models 🤏
- SmallThinker runs on 1GB RAM
✨Agentic coding goes mainstream 💻
- Qwen3-Coder: fully spec'd tool calling
- GLM-4.5: browser agents, IDE assistant
- Qwen3 WebDev demo: text-to-frontend code
✨Domain-Specific & Utility Models/Tools/Dataset
- Science one S1: Scientific model
- Agentar DeepFinance: Finance dataset
- ObjectClear: Interactive Vision Tool
- Qwen3 MT Demo: Machine Translation Tool
✨ Big month not only for models, but for policy too🏛️
- Announced Global Action Plan for AI Governance
- Proposes to set up a World AI Cooperation Organization in Shanghai
- Released International AI Open Source Collaboration Initiative
- Published Risk Assessment Guidelines for Endpoint AI Agents
✨ Big event - WAIC
- 355K offline visitors
- 108 new released in 4 days
- 145 sessions across key domains
I’ve been tracking things closely, but July’s open-source wave still blew me away. Can’t wait to see what’s coming next! 🚀
zh-ai-community/july-2025-open-works-from-the-chinese-community-686586f1a8840797e477ae5a
✨ Another "DeepSeek moment" - Kimi K2 🙌
✨ Qwen goes fully matrixed - Instruct / Thinking / Coder models across 30B - 480B 🤯
✨ The multimodal wave🌊
- GLM-4.1V-Thinking: Image+Text > Text
- Intern-S1: Image+Text > Text
- Wan 2.2 - Text +Image > video
- Skywork-R1V3: Image+Text > Text
- Skywork-UniPic: Text > Image / Image > Text
- Tar-7B: Any-to-Any
- Ming-Lite-Omni-1.5: Any-to-Any
- Step3: Image+Text > Text
- HunyuanWorld-1: Image > 3D
- ThinkSound: Video > Audio
- Neta-Lumina: Text > Image
✨Tiny & deployable models 🤏
- SmallThinker runs on 1GB RAM
✨Agentic coding goes mainstream 💻
- Qwen3-Coder: fully spec'd tool calling
- GLM-4.5: browser agents, IDE assistant
- Qwen3 WebDev demo: text-to-frontend code
✨Domain-Specific & Utility Models/Tools/Dataset
- Science one S1: Scientific model
- Agentar DeepFinance: Finance dataset
- ObjectClear: Interactive Vision Tool
- Qwen3 MT Demo: Machine Translation Tool
✨ Big month not only for models, but for policy too🏛️
- Announced Global Action Plan for AI Governance
- Proposes to set up a World AI Cooperation Organization in Shanghai
- Released International AI Open Source Collaboration Initiative
- Published Risk Assessment Guidelines for Endpoint AI Agents
✨ Big event - WAIC
- 355K offline visitors
- 108 new released in 4 days
- 145 sessions across key domains
I’ve been tracking things closely, but July’s open-source wave still blew me away. Can’t wait to see what’s coming next! 🚀
Post
1630
Qwen team did it again!!
They just released Qwen3-Coder-30B-A3B-Instruct on the hub🔥
Qwen/Qwen3-Coder-30B-A3B-Instruct
✨ Apache 2.0
✨30B total / 3.3B active (128 experts, 8 top-k)
✨ Native 256K context, extendable to 1M via Yarn
✨ Built for Agentic Coding
They just released Qwen3-Coder-30B-A3B-Instruct on the hub🔥
Qwen/Qwen3-Coder-30B-A3B-Instruct
✨ Apache 2.0
✨30B total / 3.3B active (128 experts, 8 top-k)
✨ Native 256K context, extendable to 1M via Yarn
✨ Built for Agentic Coding
Post
231
GH200 cooking time 🧑🍳🔥!
We just updated GPU-fryer 🍳 to run on Grace Hopper Superchip (GH200) - fully optimized for ARM-based systems!
With this release, we switched to cuBLASLt to support running FP8 benchmarks. You can monitor GPU throttling, TFLOPS outliers, HBM memory health, and ensure that you get the most of your hardware setup.
Perfect for stress testing and tuning datacenter GPUs.
Check it out on Github 👉 https://github.com/huggingface/gpu-fryer
We just updated GPU-fryer 🍳 to run on Grace Hopper Superchip (GH200) - fully optimized for ARM-based systems!
With this release, we switched to cuBLASLt to support running FP8 benchmarks. You can monitor GPU throttling, TFLOPS outliers, HBM memory health, and ensure that you get the most of your hardware setup.
Perfect for stress testing and tuning datacenter GPUs.
Check it out on Github 👉 https://github.com/huggingface/gpu-fryer
Post
359
It’s here! After the WAIC announcement, StepFun has just dropped Step 3 🔥 their latest multimodal reasoning model on the hub.
Paper: Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding (2507.19427)
Model: stepfun-ai/step3
✨ 321B total / 32B active - Apache 2.0
✨ MFA + AFD : cutting decoding cost by up to 70% vs. DeepSeek-V3
✨ 4T image-text pretraining: strong vision–language grounding
✨ Modular, efficient, deployable: runs on just 8×48GB GPUs
Paper: Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding (2507.19427)
Model: stepfun-ai/step3
✨ 321B total / 32B active - Apache 2.0
✨ MFA + AFD : cutting decoding cost by up to 70% vs. DeepSeek-V3
✨ 4T image-text pretraining: strong vision–language grounding
✨ Modular, efficient, deployable: runs on just 8×48GB GPUs
Post
2894
We've crossed 1 million repositories backed by Xet storage on Hugging Face! 🚀🚀🚀
You can follow along our progress converting the Hub from Git LFS to Xet at jsulz/ready-xet-go
We have a lot of repos left to migrate, which means I have plenty of time to add more animations 🤪
You can follow along our progress converting the Hub from Git LFS to Xet at jsulz/ready-xet-go
We have a lot of repos left to migrate, which means I have plenty of time to add more animations 🤪
Post
3520
Qwen3-30B-A3B-Thinking-2507 🔥 latest step in scaling thinking capabilities from Alibaba Qwen team.
Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
✨ 30B total / 3B active - Apache 2.0
✨ Native 256K context
✨ SOTA coding, alignment, agentic reasoning
Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
✨ 30B total / 3B active - Apache 2.0
✨ Native 256K context
✨ SOTA coding, alignment, agentic reasoning
Post
2717
Skywork UniPic 🔥a unified autoregressive multimodal model for image understanding, generation, & editing, by Skywork 天工
Skywork/skywork-unipic-6888c0789cdb82457b2acf32
✨ 1.5 B - MIT License
✨ Runs on RTX 4090
✨ Truly unified architecture
Skywork/skywork-unipic-6888c0789cdb82457b2acf32
✨ 1.5 B - MIT License
✨ Runs on RTX 4090
✨ Truly unified architecture
Post
1720
Qwen just released Qwen3-30B-A3B-Instruct-2507 🔥 an upgrade to the non-thinking mode model
Qwen/Qwen3-30B-A3B-Instruct-2507
✨ 30B MoE / 3.3B active - Apache 2.0
✨ Strong gains in reasoning, math, coding, & multilingual tasks
✨ Native support for 256K long-context inputs
Qwen/Qwen3-30B-A3B-Instruct-2507
✨ 30B MoE / 3.3B active - Apache 2.0
✨ Strong gains in reasoning, math, coding, & multilingual tasks
✨ Native support for 256K long-context inputs
Post
436
Wan2.2 🔥A video diffusion model with MoE just released by Alibaba_Wan
Wan-AI/Wan2.2-TI2V-5B
Wan-AI/Wan2.2-I2V-A14B-Diffusers
✨ 5B/14B - Apache2.0
✨ Cinematic-level aesthetics (lighting, tone, composition)
✨ Massive training data (+83% videos)→ smoother motion
✨ Supports image-only video generation, even without a prompt.
Wan-AI/Wan2.2-TI2V-5B
Wan-AI/Wan2.2-I2V-A14B-Diffusers
✨ 5B/14B - Apache2.0
✨ Cinematic-level aesthetics (lighting, tone, composition)
✨ Massive training data (+83% videos)→ smoother motion
✨ Supports image-only video generation, even without a prompt.
Post
367
GLM-4.5 🔥 The largest open models yet from Zhipu.
Built for intelligent agents with unified capabilities: reasoning, coding, tool use.
zai-org/glm-45-687c621d34bda8c9e4bf503b
✨ 355B total / 32B active - MIT license
✨ Hybrid reasoning modes: Thinking mode for complex tasks/ Non-thinking mode for instant replies
Built for intelligent agents with unified capabilities: reasoning, coding, tool use.
zai-org/glm-45-687c621d34bda8c9e4bf503b
✨ 355B total / 32B active - MIT license
✨ Hybrid reasoning modes: Thinking mode for complex tasks/ Non-thinking mode for instant replies
Post
326
Panshi 磐石 🪨 Scientific Foundation Model by the Chinese Academy of Sciences
ScienceOne-AI/S1-Base-8B
ScienceOne-AI/S1-Base-32B
✨ 8B/32B- Apache2.0
✨ Trained on scientific data & laws across math, physics, chemistry, bio, etc.
✨ Supports 300+ tools, 170M+ papers, autonomous scientific planning
ScienceOne-AI/S1-Base-8B
ScienceOne-AI/S1-Base-32B
✨ 8B/32B- Apache2.0
✨ Trained on scientific data & laws across math, physics, chemistry, bio, etc.
✨ Supports 300+ tools, 170M+ papers, autonomous scientific planning
Post
359
Tencent Hunyuan released their first 3D world model: Hunyuan World 1.0 🔥
tencent/HunyuanWorld-1
✨From a single prompt to explorable 3D scenes in minutes
✨ Supports Immersive roaming / Semantic-level interactivity / Physics-ready simulation
tencent/HunyuanWorld-1
✨From a single prompt to explorable 3D scenes in minutes
✨ Supports Immersive roaming / Semantic-level interactivity / Physics-ready simulation
Post
1708
Big respect to the Qwen team! They just dropped another model🔥
Qwen3-235B-A22B-Thinking-2507 🧠 new reasoning model by Qwen
Qwen/Qwen3-235B-A22B-Thinking-2507
✨ 235B total / 22B active (8 experts)
✨ 256K context window
✨ Agent-ready with tool use & <think> reasoning mode
Hope the team gets some well-deserved rest this weekend after all the massive releases 🙌
Qwen3-235B-A22B-Thinking-2507 🧠 new reasoning model by Qwen
Qwen/Qwen3-235B-A22B-Thinking-2507
✨ 235B total / 22B active (8 experts)
✨ 256K context window
✨ Agent-ready with tool use & <think> reasoning mode
Hope the team gets some well-deserved rest this weekend after all the massive releases 🙌
Post
327
Ming-lite-omni v1.5 🔥 upgrade version of Ming-lite-omni, by AntGroup.
inclusionAI/Ming-Lite-Omni-1.5
✨ 20.3B / 3B active - MoE
✨ SOTA video understanding via 3D MRoPE + curriculum learning
✨ Real time speech synthesis + dialect support
✨ Enhanced multimodal generation with ID & scene consistency
inclusionAI/Ming-Lite-Omni-1.5
✨ 20.3B / 3B active - MoE
✨ SOTA video understanding via 3D MRoPE + curriculum learning
✨ Real time speech synthesis + dialect support
✨ Enhanced multimodal generation with ID & scene consistency
Post
1588
Qwen is on fire this week 🔥
They just released Qwen3-MT 🌍 a translation model supports 92 languages.
Demo is available on the hub.
Qwen/Qwen3-MT-Demo
✨ Highly Customizable: Supports custom terms, domain prompts, and translation memory for accurate, context-aware results.
✨ Fast and affordable: $0.5 per million tokens.
They just released Qwen3-MT 🌍 a translation model supports 92 languages.
Demo is available on the hub.
Qwen/Qwen3-MT-Demo
✨ Highly Customizable: Supports custom terms, domain prompts, and translation memory for accurate, context-aware results.
✨ Fast and affordable: $0.5 per million tokens.
Post
3379
Qwen3-Coder 💻 agentic code model by Alibaba Qwen team🚀
Qwen/Qwen3-Coder-480B-A35B-Instruct
✨ 480B total, 35B activated MoE
✨ Agentic Coding + Browser Use → Top code model performance
✨ 256K context (up to 1M via Yarn) for repo-scale understanding
Qwen/Qwen3-Coder-480B-A35B-Instruct
✨ 480B total, 35B activated MoE
✨ Agentic Coding + Browser Use → Top code model performance
✨ 256K context (up to 1M via Yarn) for repo-scale understanding