EmbeddedLLM
/

Phi-3-mini-4k-instruct-onnx-directml

@@ -1,53 +1,19 @@
 ---
 license: mit
 pipeline_tag: text-generation
-tags: [ONNX, DML, ONNXRuntime, phi3, nlp, conversational, custom_code]
 inference: false
 ---
-# Phi-3 Mini-4K-Instruct ONNX DirectML models
-<!-- Provide a quick summary of what the model is/does. -->
-This repository hosts the optimized versions of [Phi-3-mini-4k-instruct](https://aka.ms/phi3-mini-4k-instruct) to accelerate inference with ONNX Runtime.
-Phi-3 Mini is a lightweight, state-of-the-art open model built upon datasets used for Phi-2 - synthetic data and filtered websites - with a focus on very high-quality, reasoning dense data. The model belongs to the Phi-3 model family, and the mini version comes in two variants: 4K and 128K which is the context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
-Optimized Phi-3 Mini models are published here in [ONNX](https://onnx.ai) format to run with [ONNX Runtime](https://onnxruntime.ai/) on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs, with the precision best suited to each of these targets.
-[DirectML](https://aka.ms/directml) support lets developers bring hardware acceleration to Windows devices at scale across AMD, Intel, and NVIDIA GPUs. Along with DirectML, ONNX Runtime provides cross platform support for Phi-3 Mini across a range of devices for CPU, GPU, and mobile.
-To easily get started with Phi-3, you can use our newly introduced ONNX Runtime Generate() API. See [here](https://aka.ms/generate-tutorial) for instructions on how to run it.
-## ONNX Models
-The optimized configurations we have added:
-- ONNX model for int4 DML: ONNX model for AMD, Intel, and NVIDIA GPUs on Windows, quantized to int4 using [AWQ](https://arxiv.org/abs/2306.00978).
-## Hardware Supported
-The models are tested on:
-- GPU SKU: RTX 4090 (DirectML)
-Minimum Configuration Required:
-- Windows: DirectX 12-capable GPU and a minimum of 4GB of combined RAM
-- CUDA: NVIDIA GPU with [Compute Capability](https://developer.nvidia.com/cuda-gpus) >= 7.0
-### Model Description
-- **Developed by:**  Microsoft
-- **Model type:** ONNX
-- **Language(s) (NLP):** Python, C, C++
-- **License:** MIT
-- **Model Description:** This is a conversion of the Phi-3 Mini-4K-Instruct model for ONNX Runtime inference.
-## Additional Details
-- [**ONNX Runtime Optimizations Blog Link**](https://aka.ms/phi3-optimizations)
-- [**Phi-3 Model Blog Link**](https://aka.ms/phi3blog-april)
-- [**Phi-3 Model Card**]( https://aka.ms/phi3-mini-4k-instruct)
-- [**Phi-3 Technical Report**](https://aka.ms/phi3-tech-report)
 ## Performance Metrics
@@ -57,16 +23,16 @@ We measured the performance of DirectML on AMD Ryzen 9 7940HS /w Radeon 78
 | Prompt Length | Generation Length | Average Throughput (tps) |
 |---------------------------|-------------------|-----------------------------|
-| 128 | 128  | 53.46686 |
-| 128 | 256  | 53.11233 |
-| 128 | 512 | 57.45816 |
-| 128 | 1024 | 33.44713 |
-| 256 | 128  | 76.50182 |
-| 256 | 256  | 66.68873 |
-| 256 | 512 | 70.83862 |
-| 256 | 1024 | 34.64715 |
-| 512 | 128  | 85.10079 |
-| 512 | 256  | 68.64049 |
 | 512 | 512 | - |
 | 512 | 1024 | -  |
 | 1024 | 128  | - |

 ---
 license: mit
 pipeline_tag: text-generation
+tags:
+ - ONNX
+ - DML
+ - ONNXRuntime
+ - phi3
+ - nlp
+ - conversational
+ - custom_code
 inference: false
+language:
+- en
 ---
+# EmbeddedLLM/Phi-3-mini-4k-instruct-onnx-directml
 ## Performance Metrics
 | Prompt Length | Generation Length | Average Throughput (tps) |
 |---------------------------|-------------------|-----------------------------|
+| 128 | 128  | - |
+| 128 | 256  | - |
+| 128 | 512 | - |
+| 128 | 1024 | - |
+| 256 | 128  | - |
+| 256 | 256  | - |
+| 256 | 512 | - |
+| 256 | 1024 | - |
+| 512 | 128  | - |
+| 512 | 256  | - |
 | 512 | 512 | - |
 | 512 | 1024 | -  |
 | 1024 | 128  | - |