Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
The most popular setups, as well as inference kernels they support are:
| Kernel | Number of codebooks | Codebook size, bits | Notation | Accuracy | Speedup | Fast GPU inference | Fast CPU inference |
|---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------|
| Triton | K | N | KxN | - | Up to ~0.7x | ✅ | ❌ |
| CUDA | 1 | 16 | 1x16 | Best | Up to ~1.3x | ✅ | ❌ |
| CUDA | 2 | 8 | 2x8 | OK | Up to ~3.0x | ✅ | ❌ |
| Numba | K | 8 | Kx8 | Good | Up to ~4.0x | ❌ | ✅ |
AWQ
Try AWQ quantization with this notebook!