4bit Quantized Model: Mistral-7B-Instruct-v0.2

This is a 4bit quantized variant of mistralai/Mistral-7B-Instruct-v0.2, optimized to reduce memory footprint and accelerate inference while maintaining high output similarity.

Overview

Mistral-7B-Instruct-v0.2 is an instruction fine-tuned model derived from Mistral-7B-v0.2, featuring:

A 32,768-token context window (upgraded from 8k in v0.1).
rope_theta=1e6 to improve long-context performance.
No sliding-window attention.
An instruction format requiring prompts wrapped in [INST] ... [/INST] tokens.
Compatibility with the mistral_common reference tokenizer for exact reproducibility.

This quantized checkpoint was produced with BitsAndBytes and evaluated using standard text similarity metrics.

Model Architecture

Attribute	Value
Model class	MistralForCausalLM
Number of parameters	3,752,071,168
Hidden size	4096
Number of layers	32
Attention heads	32
Vocabulary size	32000
Compute dtype	torch.bfloat16

Quantization Configuration

The following configuration dictionary was used during quantization:

{'quant_method': <QuantizationMethod.BITS_AND_BYTES: 'bitsandbytes'>, '_load_in_8bit': False, '_load_in_4bit': True, 'llm_int8_threshold': 6.0, 'llm_int8_skip_modules': None, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'bnb_4bit_quant_type': 'fp4', 'bnb_4bit_use_double_quant': False, 'bnb_4bit_compute_dtype': 'bfloat16', 'bnb_4bit_quant_storage': 'uint8', 'load_in_4bit': True, 'load_in_8bit': False}

Intended Use

Research and experimentation with instruction-following tasks.
Demonstrations of quantized model capabilities in resource-constrained environments.
Prototyping workflows requiring extended 32k-token context and instruction formatting.

Limitations

May reproduce biases and factual inaccuracies present in the original model.
This instruct variant does not include any moderation or safety guardrails by default.
Quantization can reduce generation diversity and precision.
Not intended for production without thorough evaluation and alignment testing.
The Transformers tokenizer may not exactly match the mistral_common reference tokenizer.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("PJEDeveloper/Mistral-7B-Instruct-v0.2-4bit-20250716_010928")
model = AutoModelForCausalLM.from_pretrained("PJEDeveloper/Mistral-7B-Instruct-v0.2-4bit-20250716_010928", device_map="auto")

prompt = "<s>[INST] Explain reinforcement learning in simple terms. [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Inference with mistral_common (Reference Implementation)

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

mistral_models_path = "MISTRAL_MODELS_PATH"
tokenizer = MistralTokenizer.v1()
model = Transformer.from_folder(mistral_models_path)

completion_request = ChatCompletionRequest(
    messages=[UserMessage(content="Explain reinforcement learning in simple terms.")]
)
tokens = tokenizer.encode_chat_completion(completion_request).tokens
out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.decode(out_tokens[0])
print(result)

Prompt Used for Evaluation

Explain the concept of reinforcement learning, including an example of how it works in real-world applications. Please format the response in a clear and concise manner suitable for someone with a basic understanding of machine learning.

Reference Output (Full-weight Model)

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with its environment. The goal of the agent is to maximize a reward signal over time by selecting actions that lead to positive outcomes and avoiding those that result in negative consequences.

In RL, the agent learns by receiving feedback in the form of rewards or penalties for each action it takes. The agent's policy is the rule it uses to select actions based on its current state and past experiences. The value function of a state represents the expected future rewards the agent can obtain by taking actions from that state.

The agent learns through iterative interactions with the environment, using techniques such as Q-learning or Deep Q-Networks (DQN) to update its policy and value function. Q-learning is a popular table-based method for RL where the agent learns a Q-table that stores the expected future rewards for each state-action pair. DQN, on the other hand, is a deep learning variant of Q-learning that learns a neural network approximator to estimate the Q-values.

Let's consider a real-world example of RL in action: playing the video game, "Breakout." In Breakout, the agent's goal is to bounce a ball off its paddle to destroy bricks without losing the ball. The agent receives a reward of +1 for each brick it destroys and a penalty of -1 for each time it loses the ball. The agent's state would comprise information about the positions of the ball and paddle, as well as the layout of the remaining bricks.

The agent uses Q-learning to determine the best action (i.e., the angle and force to apply to the paddle to hit the ball) given its current state. Over time, as the agent plays the game, it updates its Q-table based on the rewards it receives from each action. Eventually, the agent learns an optimal policy that leads to the maximum reward, i.e., destroying as many bricks as possible without losing the ball.

In summary, RL enables agents to learn optimal policies by trial-and-error interaction with their environment, guided by rewards and punishments. Real-world applications of RL include robotics, gaming, and optimizing complex systems like power grids or traffic control systems.

**Quantized Model Response**
Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with its environment. The agent aims to maximize a reward signal, which is provided by the environment after each action. The agent's goal is to learn a policy, which is a mapping from states to actions that maximizes the expected future reward.

The RL process can be broken down into several steps:

1. **Initialization**: The agent is placed in an initial state, and the learning process begins.

2. **Observation**: The agent observes the current state of the environment.

3. **Action Selection**: Based on the current state and its learned policy, the agent selects an action to take.

4. **Environment Transition**: The environment transitions to a new state based on the agent's action.

5. **Reward Collection**: The agent receives a reward from the environment based on the new state and the previous action.

6. **Learning**: The agent updates its policy based on the new information gained from the environment interaction.

7. **Repeat**: The process is repeated from step 2.

A classic example of RL is the Q-Learning algorithm, which learns to play a game like chess or checkers. In this example, the agent's goal is to learn a policy that maximizes the reward of winning the game. The agent observes the current state of the board, selects an action (e.g., moving a piece), receives a reward based on the outcome of the move (e.g., winning or losing a piece), and updates its policy based on the new information. Over time, the agent learns to make optimal moves that maximize the expected future reward of winning the game.

Another real-world application of RL is in robotics, where an agent learns to navigate an environment and perform tasks. For example, an RL agent could learn to navigate a maze to find a goal, receiving a reward when it reaches the goal. The agent would learn to maximize the expected future reward of reaching the goal by learning a policy that maps states to actions that lead to the goal. This could be useful in applications such as autonomous vehicles or industrial robots.

Quantized Model Output

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with its environment. The agent aims to maximize a reward signal, which is provided by the environment after each action. The agent's goal is to learn a policy, which is a mapping from states to actions that maximizes the expected future reward.

The RL process can be broken down into several steps:

1. **Initialization**: The agent is placed in an initial state, and the learning process begins.

2. **Observation**: The agent observes the current state of the environment.

3. **Action Selection**: Based on the current state and its learned policy, the agent selects an action to take.

4. **Environment Transition**: The environment transitions to a new state based on the agent's action.

5. **Reward Collection**: The agent receives a reward from the environment based on the new state and the previous action.

6. **Learning**: The agent updates its policy based on the new information gained from the environment interaction.

7. **Repeat**: The process is repeated from step 2.

A classic example of RL is the Q-Learning algorithm, which learns to play a game like chess or checkers. In this example, the agent's goal is to learn a policy that maximizes the reward of winning the game. The agent observes the current state of the board, selects an action (e.g., moving a piece), receives a reward based on the outcome of the move (e.g., winning or losing a piece), and updates its policy based on the new information. Over time, the agent learns to make optimal moves that maximize the expected future reward of winning the game.

Another real-world application of RL is in robotics, where an agent learns to navigate an environment and perform tasks. For example, an RL agent could learn to navigate a maze to find a goal, receiving a reward when it reaches the goal. The agent would learn to maximize the expected future reward of reaching the goal by learning a policy that maps states to actions that lead to the goal. This could be useful in applications such as autonomous vehicles or industrial robots.

Evaluation Metrics

Metric	Value
ROUGE-L F1	0.6483
BLEU	0.3498
Cosine Similarity	0.8984
BERTScore F1	0.7208

Higher ROUGE and BLEU scores indicate closer alignment with the original output.

Interpretation: The quantized model output maintains substantial similarity to the full-weight model.

Warning: The quantized output has 29 sentences, while the reference has 50. This may indicate structural divergence.

Generation Settings

This model produces best results when generated with:

max_new_tokens=2048,
do_sample=False,
temperature=0.3,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id

Model Files Metadata

Filename	Size (bytes)	SHA-256
`quant_config.txt`	446	`f7a08f6dc4b46a4803dce152c536ceed2ee802755840db11231fb5a895b2e022`

Notes

Produced on 2025-07-16T01:15:59.380625.
Quantized automatically using BitsAndBytes.
Base model: mistralai/Mistral-7B-Instruct-v0.2 with 32k context window and rope_theta=1e6.

Intended primarily for research and experimentation.

Citation

Mistralai/Mistral-7B-Instruct-v0.2

Mistral 7B Announcement

License

This model is distributed under the Apache 2.0 license, consistent with the original Mistral-7B-Instruct-v0.2.

Model Card Authors

This quantized model was prepared by PJEDeveloper.