Vinitha2004 commited on
Commit
88eee7d
Β·
verified Β·
1 Parent(s): 064f953

Upload distilled Qwen2.5-Coder-3B model with knowledge distillation

Browse files
.gitattributes CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model_f16.gguf filter=lfs diff=lfs merge=lfs -text
37
+ model_q4_0.gguf filter=lfs diff=lfs merge=lfs -text
38
+ model_q5_0.gguf filter=lfs diff=lfs merge=lfs -text
39
+ model_q8_0.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ base_model: Qwen/Qwen2.5-Coder-3B-Instruct-AWQ
4
+ tags:
5
+ - knowledge-distillation
6
+ - code-generation
7
+ - qwen
8
+ - lora
9
+ - distilled
10
+ license: apache-2.0
11
+ ---
12
+
13
+ # Qwen2.5-Coder-3B Distilled Model
14
+
15
+ This is a **knowledge-distilled** version of Qwen2.5-Coder-3B-Instruct-AWQ, trained using knowledge distillation from Qwen2.5-Coder-7B-Instruct-AWQ.
16
+
17
+ ## Model Details
18
+
19
+ - **Base Model**: Qwen/Qwen2.5-Coder-3B-Instruct-AWQ
20
+ - **Teacher Model**: Qwen/Qwen2.5-Coder-7B-Instruct-AWQ
21
+ - **Training Method**: Knowledge Distillation with LoRA
22
+ - **Best Validation Loss**: 1.9286
23
+ - **Training Time**: ~5 minutes
24
+ - **Parameters Trained**: 14.9M (4.59% of base model)
25
+
26
+ ## Training Configuration
27
+
28
+ - **Temperature**: 2.0 (optimal)
29
+ - **Alpha**: 0.95 (95% distillation weight)
30
+ - **LoRA Rank**: 8
31
+ - **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
32
+
33
+ ## Usage
34
+
35
+ ```python
36
+ from transformers import AutoTokenizer, AutoModelForCausalLM
37
+ from peft import PeftModel
38
+
39
+ # Load base model and tokenizer
40
+ base_model = AutoModelForCausalLM.from_pretrained(
41
+ "Qwen/Qwen2.5-Coder-3B-Instruct-AWQ",
42
+ torch_dtype=torch.float16,
43
+ device_map="auto"
44
+ )
45
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct-AWQ")
46
+
47
+ # Load distilled adapter
48
+ model = PeftModel.from_pretrained(base_model, "Vinitha2004/qwen2.5-coder-3b-instruct-awq-gguf")
49
+
50
+ # Generate code
51
+ input_text = "Original Code:\ndef add(a, b):\n return a + b\n\nUpdate Snippet:\n// ... existing code ...\ndef add(a: int, b: int) -> int:\n// ... existing code ...\n\nUpdated Code:\n"
52
+ inputs = tokenizer(input_text, return_tensors="pt")
53
+ outputs = model.generate(**inputs, max_new_tokens=100)
54
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
55
+ print(result)
56
+ ```
57
+
58
+ ## Performance
59
+
60
+ This distilled model retains the knowledge from the 7B teacher model while being significantly more efficient:
61
+ - **Faster inference** (3B vs 7B parameters)
62
+ - **Lower memory usage**
63
+ - **Maintained code generation quality**
64
+
65
+ ## Training Dataset
66
+
67
+ Trained on 5000 code editing examples from custom dataset.
68
+
69
+ ## Files
70
+
71
+ - `adapter_config.json`: LoRA configuration
72
+ - `adapter_model.safetensors`: Trained LoRA weights (59MB)
73
+ - Other standard tokenizer files
fast_inference.py ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Optimized inference script for GGUF models
4
+ Supports llama-cpp-python for maximum speed
5
+ """
6
+
7
+ import argparse
8
+ import time
9
+ from pathlib import Path
10
+ import multiprocessing
11
+
12
+ try:
13
+ from llama_cpp import Llama
14
+ LLAMA_CPP_AVAILABLE = True
15
+ except ImportError:
16
+ LLAMA_CPP_AVAILABLE = False
17
+ print("llama-cpp-python not available.")
18
+ print("Install with: pip install llama-cpp-python")
19
+
20
+ class FastInference:
21
+ """Optimized inference class for GGUF models"""
22
+
23
+ def __init__(self, model_path: str, n_ctx: int = 4096, n_threads: int = -1):
24
+ self.model_path = model_path
25
+
26
+ if not LLAMA_CPP_AVAILABLE:
27
+ raise ImportError("llama-cpp-python required for GGUF inference")
28
+
29
+ # Use all CPU threads if not specified
30
+ if n_threads == -1:
31
+ n_threads = multiprocessing.cpu_count()
32
+
33
+ # Initialize model with optimized settings
34
+ self.model = Llama(
35
+ model_path=model_path,
36
+ n_ctx=n_ctx,
37
+ n_threads=n_threads,
38
+ n_batch=512, # Batch size for prompt processing
39
+ n_gpu_layers=-1 if self._has_gpu() else 0, # Use GPU if available
40
+ use_mmap=True, # Memory-mapped files
41
+ use_mlock=True, # Lock memory
42
+ verbose=False
43
+ )
44
+
45
+ print(f"Model loaded: {model_path}")
46
+ print(f"Context length: {n_ctx}")
47
+ print(f"Threads: {n_threads}")
48
+ print(f"GPU layers: {-1 if self._has_gpu() else 0}")
49
+
50
+ def _has_gpu(self) -> bool:
51
+ """Check if GPU is available"""
52
+ try:
53
+ import torch
54
+ return torch.cuda.is_available()
55
+ except ImportError:
56
+ return False
57
+
58
+ def generate(self, prompt: str, max_tokens: int = 512, temperature: float = 0.7) -> str:
59
+ """Generate text with optimized settings"""
60
+
61
+ start_time = time.time()
62
+
63
+ # Optimized generation parameters
64
+ response = self.model(
65
+ prompt,
66
+ max_tokens=max_tokens,
67
+ temperature=temperature,
68
+ top_p=0.9,
69
+ repeat_penalty=1.1,
70
+ stop=["</code>", "\n\n\n"], # Stop sequences
71
+ stream=False
72
+ )
73
+
74
+ generation_time = time.time() - start_time
75
+ generated_text = response['choices'][0]['text']
76
+
77
+ # Calculate tokens per second
78
+ estimated_tokens = len(generated_text.split())
79
+ tokens_per_sec = estimated_tokens / generation_time if generation_time > 0 else 0
80
+
81
+ print(f"\nπŸ“Š Performance:")
82
+ print(f" Time: {generation_time:.2f}s")
83
+ print(f" Speed: {tokens_per_sec:.1f} tokens/sec")
84
+ print(f" Tokens: {estimated_tokens}")
85
+
86
+ return generated_text
87
+
88
+ def generate_stream(self, prompt: str, max_tokens: int = 512, temperature: float = 0.7):
89
+ """Generate text with streaming"""
90
+
91
+ print("\nπŸš€ Streaming response:")
92
+ start_time = time.time()
93
+ total_tokens = 0
94
+
95
+ stream = self.model(
96
+ prompt,
97
+ max_tokens=max_tokens,
98
+ temperature=temperature,
99
+ top_p=0.9,
100
+ repeat_penalty=1.1,
101
+ stop=["</code>", "\n\n\n"],
102
+ stream=True
103
+ )
104
+
105
+ for chunk in stream:
106
+ text = chunk['choices'][0]['text']
107
+ print(text, end='', flush=True)
108
+ total_tokens += 1
109
+
110
+ generation_time = time.time() - start_time
111
+ tokens_per_sec = total_tokens / generation_time if generation_time > 0 else 0
112
+
113
+ print(f"\n\nπŸ“Š Streaming Performance:")
114
+ print(f" Time: {generation_time:.2f}s")
115
+ print(f" Speed: {tokens_per_sec:.1f} tokens/sec")
116
+
117
+ def chat_mode(self):
118
+ """Interactive chat mode"""
119
+ print("\nπŸ€– Interactive Chat Mode")
120
+ print("Commands: 'exit' to quit, 'stream' to toggle streaming")
121
+ print("-" * 50)
122
+
123
+ use_streaming = False
124
+
125
+ while True:
126
+ try:
127
+ prompt = input("\nπŸ‘€ You: ")
128
+
129
+ if prompt.lower() == 'exit':
130
+ print("πŸ‘‹ Goodbye!")
131
+ break
132
+ elif prompt.lower() == 'stream':
133
+ use_streaming = not use_streaming
134
+ print(f"πŸ”„ Streaming {'enabled' if use_streaming else 'disabled'}")
135
+ continue
136
+
137
+ print("πŸ€– Assistant:", end=" ")
138
+
139
+ if use_streaming:
140
+ self.generate_stream(prompt)
141
+ else:
142
+ response = self.generate(prompt)
143
+ print(response)
144
+
145
+ except KeyboardInterrupt:
146
+ print("\n\nπŸ‘‹ Goodbye!")
147
+ break
148
+
149
+ def main():
150
+ parser = argparse.ArgumentParser(description="Fast GGUF Model Inference")
151
+ parser.add_argument("--model", required=True, help="Path to GGUF model file")
152
+ parser.add_argument("--prompt", help="Text prompt for generation")
153
+ parser.add_argument("--max-tokens", type=int, default=512, help="Maximum tokens to generate")
154
+ parser.add_argument("--temperature", type=float, default=0.7, help="Generation temperature")
155
+ parser.add_argument("--ctx-size", type=int, default=4096, help="Context size")
156
+ parser.add_argument("--threads", type=int, default=-1, help="Number of threads (-1 for auto)")
157
+ parser.add_argument("--interactive", action="store_true", help="Start interactive chat mode")
158
+ parser.add_argument("--stream", action="store_true", help="Use streaming generation")
159
+
160
+ args = parser.parse_args()
161
+
162
+ # Initialize inference
163
+ print(f"πŸš€ Loading model: {args.model}")
164
+ inferencer = FastInference(
165
+ args.model,
166
+ n_ctx=args.ctx_size,
167
+ n_threads=args.threads
168
+ )
169
+
170
+ if args.interactive:
171
+ inferencer.chat_mode()
172
+ elif args.prompt:
173
+ if args.stream:
174
+ inferencer.generate_stream(args.prompt, args.max_tokens, args.temperature)
175
+ else:
176
+ response = inferencer.generate(args.prompt, args.max_tokens, args.temperature)
177
+ print("\nπŸ€– Generated text:")
178
+ print(response)
179
+ else:
180
+ print("Please provide --prompt or use --interactive mode")
181
+ print("Example: python fast_inference.py --model model.gguf --prompt 'def hello():' --interactive")
182
+
183
+ if __name__ == "__main__":
184
+ main()
model_f16.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ababe61c1ed0823aec714131aa3e1080a709c91768d014bf9b5b6f2fb7c00003
3
+ size 6178314016
model_q4_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:09252b11853433b8af2440225ed7fdd1b2ff2e124f7baa26b67b10f11b1e6cbf
3
+ size 1822846752
model_q5_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26cae77c7826aa7178a9f64fe873df12d2cc669d691facd524b20ca714b8f136
3
+ size 2169663264
model_q8_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:01d3985cc95e8b9496bee83a7b1a947191d93ca2057987585cdd9a001f339db7
3
+ size 3285473056
training_metadata.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "training_completed": true,
3
+ "distillation_method": "knowledge_distillation",
4
+ "teacher_model": "Qwen/Qwen2.5-Coder-7B-Instruct-AWQ",
5
+ "student_model": "Qwen/Qwen2.5-Coder-3B-Instruct-AWQ",
6
+ "best_validation_loss": 1.9286,
7
+ "optimal_temperature": 2.0,
8
+ "optimal_alpha": 0.95,
9
+ "training_samples": 118,
10
+ "validation_samples": 23,
11
+ "test_samples": 100
12
+ }