QwQ-32B vs DeepSeek-R1-671B

Ahamed Musthafa R S (amrs-tech) - Mar 6 - - Dev Community

Qwen is a series of LLMs released and maintained by Alibaba Cloud. QwQ is the model with reasoning capabilities in Qwen series. A while ago, the team released a preview version of this model and now, they’ve released QwQ-32B model completely. It is available in Huggingface and Ollama model repository.

Image generated by ChatGPT

Links

https://huggingface.co/Qwen/QwQ-32B
https://ollama.com/library/qwq

They’ve used reinforcement learning (RL) scaling approach driven by outcome-based rewards. As mentioned in their blog post, instead of a traditional reward model, an accuracy verifier is used in training this model. It is trained with rewards from general reward model and some rule-based verifiers. You can use QwQ-32B via Hugging Face Transformers and Alibaba Cloud DashScope API.

Example Code with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/QwQ-32B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r's are in the word \"strawberry\""
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Enter fullscreen mode Exit fullscreen mode

Example Code with DashScope API

from openai import OpenAI
import os

# Initialize OpenAI client
client = OpenAI(
    # If the environment variable is not configured, replace with your API Key: api_key="sk-xxx"
    # How to get an API Key:https://help.aliyun.com/zh/model-studio/developer-reference/get-api-key
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1"
)

reasoning_content = ""
content = ""

is_answering = False

completion = client.chat.completions.create(
    model="qwq-32b",
    messages=[
        {"role": "user", "content": "Which is larger, 9.9 or 9.11?"}
    ],
    stream=True,
    # Uncomment the following line to return token usage in the last chunk
    # stream_options={
    #     "include_usage": True
    # }
)

print("\n" + "=" * 20 + "reasoning content" + "=" * 20 + "\n")

for chunk in completion:
    # If chunk.choices is empty, print usage
    if not chunk.choices:
        print("\nUsage:")
        print(chunk.usage)
    else:
        delta = chunk.choices[0].delta
        # Print reasoning content
        if hasattr(delta, 'reasoning_content') and delta.reasoning_content is not None:
            print(delta.reasoning_content, end='', flush=True)
            reasoning_content += delta.reasoning_content
        else:
            if delta.content != "" and is_answering is False:
                print("\n" + "=" * 20 + "content" + "=" * 20 + "\n")
                is_answering = True
            # Print content
            print(delta.content, end='', flush=True)
            content += delta.content
Enter fullscreen mode Exit fullscreen mode

Performance Evaluation

Below is the evaluation chart to show how this 32B model is competing against other reasoning model, especially DeepSeek-R1–671B.

Image from blog post by Qwen

It is heavily competing against DeepSeek-R1–671B model in all the five benchmarks and outperforming OpenAI-o1-mini (except for IFEval).

I was wondering what OpenAI’s ChatGPT would think about it 🤣

Here’s a small insight from the comparison between QwQ-32b and Deepseek-r1–671b, generated by ChatGPT with this provided chart.
(NOTE: For some reason, 671b is shown 67.1b by ChatGPT — Please ignore)

Image by Author (Screenshot of output from ChatGPT)

It is clear that the storage requirements and hardware requirements are higher for Deepseek-R1–671b model when compared to QwQ-32b model.

Based on the evaluation, it is understood that the QwQ model is performing mostly very well for real-world tasks.

You can also try using the QwQ reasoning model in Qwen Chat at https://chat.qwen.ai/

NOTE — Referenced this video for some information: https://youtu.be/W85kbOduL8c?si=058s4_cmslrhRAxk

Happy Learning !

. . .