forward peak memory/batch size | |
generate peak memory/batch size | |
generate throughput/batch size | |
forward latency/batch size | |
The benchmarks indicate AWQ quantization is the fastest for inference, text generation, and has the lowest peak memory for text generation. |