|
--- |
|
license: mit |
|
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
|
pipeline_tag: text-generation |
|
tags: |
|
- chat |
|
--- |
|
|
|
# litert-community/DeepSeek-R1-Distill-Qwen-1.5B |
|
|
|
This model provides a few variants of |
|
[deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for |
|
deployment on Android using the |
|
[LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and |
|
[MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference). |
|
|
|
## Use the models |
|
|
|
### Colab |
|
|
|
*Disclaimer: The target deployment surface for the LiteRT models is |
|
Android/iOS/Web and the stack has been optimized for performance on these |
|
targets. Trying out the system in Colab is an easier way to familiarize yourself |
|
with the LiteRT stack, with the caveat that the performance (memory and latency) |
|
on Colab could be much worse than on a local device.* |
|
|
|
[](https://colab.research.google.com/#fileId=https://huggingface.co/litert-community/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/notebook.ipynb) |
|
|
|
### Android |
|
|
|
* Download and install |
|
[the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk). |
|
* Follow the instructions in the app. |
|
|
|
To build the demo app from source, please follow the |
|
[instructions](https://github.com/google-ai-edge/mediapipe-samples/blob/main/examples/llm_inference/android/README.md) |
|
from the GitHub repository. |
|
|
|
## Performance |
|
|
|
### Android |
|
|
|
Note that all benchmark stats are from a Samsung S24 Ultra with |
|
1280 KV cache size with multiple prefill signatures enabled. |
|
|
|
<table border="1"> |
|
<tr> |
|
<th></th> |
|
<th>Backend</th> |
|
<th>Prefill (tokens/sec)</th> |
|
<th>Decode (tokens/sec)</th> |
|
<th>Time-to-first-token (sec)</th> |
|
<th>Memory (RSS in MB)</th> |
|
<th>Model size (MB)</th> |
|
</tr> |
|
<tr> |
|
<td>fp32 (baseline)</td> |
|
<td>cpu</td> |
|
<td><p style="text-align: right">39.56 tk/s</p></td> |
|
<td><p style="text-align: right">1.43 tk/s</p></td> |
|
<td><p style="text-align: right">19.24 s</p></td> |
|
<td><p style="text-align: right">5,997 MB</p></td> |
|
<td><p style="text-align: right">6,794 MB</p></td> |
|
</tr> |
|
<tr> |
|
<td>dynamic_int8</td> |
|
<td>cpu</td> |
|
<td><p style="text-align: right">110.58 tk/s</p></td> |
|
<td><p style="text-align: right">12.96 tk/s</p></td> |
|
<td><p style="text-align: right">6.81 s</p></td> |
|
<td><p style="text-align: right">3,598 MB</p></td> |
|
<td><p style="text-align: right">1,774 MB</p></td> |
|
</tr> |
|
|
|
</table> |
|
|
|
* Model Size: measured by the size of the .tflite flatbuffer (serialization |
|
format for LiteRT models) |
|
* Memory: indicator of peak RAM usage |
|
* The inference on CPU is accelerated via the LiteRT |
|
[XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads |
|
* Benchmark is done assuming XNNPACK cache is enabled |
|
* dynamic_int8: quantized model with int8 weights and float activations. |
|
|