Coding Relic: Training Gemma3-270m for German Q-and-A

Google recently introduced Gemma3-270M, a smaller Gemma3 model with "only" 270 million parameters instead of billions.

The most interesting aspect of this model to me is that it is explicitly intended to be able to run locally, without requiring highly specialized infrastructure — well within what is achievable outside of specialized datacenters. The potential to run the model with an air gap, isolating it from outside, would be interesting for some future stuff I'm working on.

The eventual uses would involve communication in the German language, so I decided to see about adding training to answer questions in German specifically. I referenced an existing colab notebook, which uses Gemma3-270M to predict chess moves. Chess as an application for LLMs isn't as interesting for me personally, we have better ways to use neural networks to play chess, but the training flow is the same.

We start by loading dependencies and instantiating the gemma-3-270m-it model.

%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft
    !pip install --no-deps trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth


from unsloth import FastModel
import torch
max_seq_length = 2048
model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-270m-it",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

We set it up to accept training data in a chat format using the Huggingface deepset/germanquad dataset, a curated set of training data from the Deutsch Wikipedia and various academic sources.

model = FastModel.get_peft_model(
    model, r = 128,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 128, lora_dropout = 0, bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407, # Seems pretty random
    use_rslora = False, loftq_config = None,
)

from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(tokenizer, chat_template = "gemma3")

from datasets import load_dataset
dataset = load_dataset("deepset/germanquad", split = "train[:10000]")

def convert_to_chatml(example):
    return {
        "conversations": [
            {"role": "system", "content": example["context"]},
            {"role": "user", "content": example["question"]},
            {"role": "assistant", "content": example["answers"]["text"][0]}
        ]
    }
dataset = dataset.map(convert_to_chatml)

def formatting_prompts_func(examples):
   convos = examples["conversations"]
   texts = [tokenizer.apply_chat_template(convo,tokenize = False,
       add_generation_prompt = False).removeprefix('<bos>') for convo in convos]
   return { "text" : texts, }
dataset = dataset.map(formatting_prompts_func, batched = True)

from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
    model = model, tokenizer = tokenizer,
    train_dataset = dataset, eval_dataset = None,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 1,
        warmup_steps = 5, num_train_epochs = 1,
        max_steps = 100, learning_rate = 5e-5,
        logging_steps = 1, optim = "adamw_8bit",
        weight_decay = 0.01, lr_scheduler_type = "linear",
        seed = 3407, output_dir="outputs",
        report_to = "none",
    ),
)

from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<start_of_turn>user\n",
    response_part = "<start_of_turn>model\n",
)

We then train the model. This took about three minutes on Google Colab using a Tensor T4 system.

trainer_stats = trainer.train()

Now, the real test: can it give good answers to questions not in its training data?

messages = [
    {'role': 'system','content': 'Bielefeld'},
    {"role" : 'user', 'content' : 'Gibt es Bielefeld?'}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = True, # Must add for generation
).removeprefix('<bos>')

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 125,
    temperature = 1, top_p = 0.95, top_k = 64,
    streamer = TextStreamer(tokenizer, skip_prompt = True),
)

<bos><start_of_turn>user
Gibt es Bielefeld?
<end_of_turn>

<start_of_turn>model
Ja
<end_of_turn>

Indeed yes, it can!

If that interaction doesn't make much sense: it is a German joke, alleging that the city of Bielefeld doesn't actually exist. Wikipedia has an explanation in English.

The trained model says that Bielefeld does exist. Clearly it has no sense of humor.

Monday, August 18, 2025

Training Gemma3-270m for German Q-and-A