mlx-community/GLM-4-32B-0414-4bit · 2048 context length?

huyzed

MLX Community org Apr 26

In LM Studio, the context window is set to a max of 2048.

Is that expected? It seems quite low compared to all the other recent models I've worked with.

Goekdeniz-Guelmez

MLX Community org Apr 27

Yea I know its weird but no this model along with the others has a context of 32768

huyzed

MLX Community org Apr 29

@Goekdeniz-Guelmez any idea how to override this max in LM Studio?

ProtoParticle

MLX Community org Apr 30

•

edited Apr 30

Edit: Seems this doesn't work anymore.

@Goekdeniz-Guelmez any idea how to override this max in LM Studio?

In config.json file in a model folder change the value of line "max_position_embeddings" to "max_position_embeddings": 32768,

R-I-n-g-o

May 1

tl;dr
The model is fine – LM Studio guesses 2048 for MLX builds. Set the Context Length manually (gear icon ▶ 32768)

Why “Max context 2048”?

LM Studio’s MLX backend (the one that loads Apple-Silicon-friendly .npz/4-bit weights) in some versions of mlx-lm the indexer can’t read long-context hints from GLM configs, so it falls back to 2048 and prints that in the UI. The same issue was reported for other MLX conversions (Gemma-3, etc.) (Gemma 3 Context Window capped at 4096 · Issue #48 · ml-explore/mlx-lm · GitHub)

Good news it that seems to be just cosmetic – override it at load time. 

The quick fix with no file editing

In My Models ▸ GLM-4-32B-0414-4bit click ⚙︎ Load settings.
Change Context length from 2048 → 32768 (or whatever your VRAM allows). 32 k @ 4-bit on a 32 B model is ~18 GiB just for the KV-cache – start lower if you’re on an M-series with <64 GB unified memory.
Press Save as default → Load model.  It might say “max 2048” in some places, but generation runs past that. I had it make a bunch of scripts for me and it didn't even get close to the context window filling up.

If loading with the REST SDK just add the parameter also:

{
  model: "mlx-community/GLM-4-32B-0414-4bit",
  loadConfig: {
    contextLength: 32768,
    ropeFrequencyBase: 1_000_000,   // optional but helps with >8 k
    ropeFrequencyScale: 1.0
  }
}

had done some other stuff but that seems to have fixed it for me. It didn't have a bos_token in the config and the eos_token was two values, also there was double quant stuff so also changed that. Im including my jacked up config if the above doesn't work by itself. But believe the above is what actually got it to work and if not my config is below.

{
  "architectures": ["Glm4ForCausalLM"],
  "attention_bias": false,
  "attention_dropout": 0.0,

  "bos_token_id": 151329,
  "eos_token_id": 151336,

  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 6144,
  "initializer_range": 0.02,
  "intermediate_size": 23040,
  "max_position_embeddings": 32768,
  "model_type": "glm4",
  "num_attention_heads": 48,
  "num_hidden_layers": 61,
  "num_key_value_heads": 2,
  "pad_token_id": 151329,
  "partial_rotary_factor": 0.5,

  "quantization": {
    "group_size": 64,
    "bits": 4
  },

  "rms_norm_eps": 1e-05,
  "rope_theta": 10000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.43.0",
  "use_cache": true,
  "vocab_size": 151552,

  "additional_eos_token_ids": [151329, 151338]
}

R-I-n-g-o

May 1