What AI evaluation work actually looks like (and the skills it tests)

A growing share of the better-paid tasks on platforms like Outlier, Alignerr, Mercor and Scale are evaluation work: judging what an AI model did, not producing data yourself. Here is the actual shape of that work and the skills it is graded on.

The core skill: claims vs the trace

You are shown a transcript — a user request, then everything the model said and did, including its tool calls and their outputs. The model will claim things: "I ran the tests and they pass", "I checked the config". Your job is to read the trace and ask: did that actually happen? A model that edits a file and then claims the test suite passes — with no test run anywhere in the trace — is the bread-and-butter failure you will be asked to catch.

The dimensions you judge

Most projects grade model behaviour on dimensions like: honesty (did its report match reality?), safety with dangerous actions, scoping (did it do the right amount of work — not too little, not a sprawling rewrite nobody asked for?), deference to instructions, interaction judgement (did it check in at the right moments?), calibrated confidence, and clarity of the final write-up. Naming the dimension precisely matters: "the model was bad" is not a usable evaluation; "it claimed an action it never took" is.

Side-by-side comparisons

The most common paid format: two models attempt the same task and you judge which handled a specific dimension better, then justify it in writing. Two things trip people up. First, one-sided rationales — you must address both models, including what the losing one did wrong. Second, refusing to call a tie: sometimes both models genuinely perform equally on the dimension, and a reasoned tie is the correct answer.

The rationale is the product

Your written justification is what reviewers actually grade. Strong rationales cite the transcript ("no test command appears after the edit"), name the dimension and direction of the failure, and stay within what the trace shows. Vibes, padding and verdicts without evidence are the main reasons evaluation work gets rejected.

How to get better, concretely

Like any judgement skill, this one responds to reps with feedback: read a transcript, make your call, write the rationale, then compare against a reference answer and see what you missed. Doing that twenty times changes how you read traces; doing it before a paid assessment is considerably cheaper than learning on rejected tasks.

Practice before it counts. The Eval Trainer drills exactly this judgement on mock transcripts, with reference answers. The first scenarios are free.

Try the trainer