12 4 8

Rakshit Aralimatti

RakshitAralimatti

AI & ML interests

Nvidia

Recent Activity

reacted to codelion's post with 🔥 about 22 hours ago

I wanted to share a technique that's been working really well for recovering performance after INT4 quantization. Typically, quantizing the LLM to INT4 (unlike say INT8) for inference can incur some accuracy loss. Instead of accepting the quality loss, we used the FP16 model as a teacher to train a tiny LoRA adapter (rank=16) for the quantized model. The cool part: the model generates its own training data using the Magpie technique so no external datasets needed. This is critical because we want to remain as much as possible in the distribution of the model's natural responses. Last year Apple's foundational models paper (https://arxiv.org/pdf/2407.21075) had proposed a similar technique and found "By using accuracy-recovery LoRA adapters with only rank 16, Alpaca win rate can be improved by 7-18%, GMS8K accuracy is boosted by 5-10%." (page 47). We saw similar results on Qwen3-0.6B: Perplexity: 2.40 → 2.09 (only 5.7% degradation from FP16 baseline) Memory: Only 0.28GB vs 1.0GB for FP16 (75% reduction) Speed: 3.0x faster inference than FP16 Quality: Generates correct, optimized code solutions - Pre-trained adapter: https://huggingface.co/codelion/Qwen3-0.6B-accuracy-recovery-lora - GitHub repo: https://github.com/codelion/ellora Happy to answer questions about the implementation or help anyone trying to replicate this. The key insight is that quantization errors are systematic and learnable - a small adapter can bridge the gap without negating the benefits of quantization. Has anyone else experimented with self-distillation for quantization recovery? Would love to hear about different approaches!

replied to their post 11 days ago

When you ask ChatGPT, Claude, or Gemini a really tough question, you might notice that little "thinking..." moment before it answers. But what does it actually mean when an LLM is “thinking”? Imagine a chess player pausing before their next move not because they don’t know how to play, but because they’re running through possibilities, weighing options, and choosing the best one. LLMs do something similar… except they’re not really thinking like us. Here’s the surprising part :- You might think these reasoning skills come from futuristic architectures or alien neural networks. In reality, most reasoning LLMs still use the same transformer decoder-only architecture as other models The real magic? It’s in how they’re trained and what data they learn from. Can AI actually think, or is it just insanely good at faking it? I broke it down in a simple, 4-minute Medium read. Bet you’ll walk away with at least one “aha!” moment. 🚀 Read here - https://lnkd.in/edZ8Ceyg

replied to their post 11 days ago

View all activity

Organizations

liked 3 models about 1 year ago

liked 2 models over 1 year ago

meta-llama/Meta-Llama-3-8B

Text Generation • 8B • Updated Sep 27, 2024 • 1.37M • • 6.29k

microsoft/phi-2

Text Generation • 3B • Updated Apr 29, 2024 • 762k • 3.39k

liked a Space over 1 year ago

1.05k

Open ASR Leaderboard

🏆

View and request speech recognition model benchmarks

liked a model over 1 year ago

mistralai/Mixtral-8x7B-v0.1

47B • Updated Jul 24 • 58.3k • 1.74k

liked a Space almost 2 years ago

13.5k

Open LLM Leaderboard

🏆

Track, rank and evaluate open LLMs and chatbots