The Claude API: a practical guide for developers

May 26, 20268 min read

The Claude API is straightforward to start with and deep enough to build serious products on. This guide covers the four features that show up most in real projects: streaming, prompt caching, tool use, and extended thinking. Each section includes working code and the reasoning behind the choices.

Setup

Install the SDK and set your key:

npm install @anthropic-ai/sdk

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

All examples below use this client.

1. Streaming

Non-streaming requests block until the model finishes. For anything longer than a sentence, that is several seconds of nothing — enough for users to think it is broken. Streaming sends tokens as they arrive.

const stream = anthropic.messages.stream({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Explain RSA encryption simply." }],
});

for await (const event of stream) {
  if (
    event.type === "content_block_delta" &&
    event.delta.type === "text_delta"
  ) {
    process.stdout.write(event.delta.text);
  }
}

Claude's event stream emits several event types — message_start, content_block_start, content_block_delta, message_delta, message_stop. Only content_block_delta with type === "text_delta" carries actual text. The rest carry metadata like token counts and stop reasons.

If you are building a web endpoint and want to forward the stream to a browser, pipe it through a ReadableStream with TextEncoder. The client reads with getReader() and appends each chunk to state. The critical detail is calling stream.abort() in the cancel callback so the API request stops if the user navigates away — otherwise it runs to completion and burns tokens unnecessarily.

2. Prompt Caching

If your application sends the same large context repeatedly — a long system prompt, a reference document, conversation history — you are re-sending and re-processing those tokens on every request. Prompt caching lets Claude cache a prefix of the input and reuse it, cutting costs by up to 90% and latency significantly on cache hits.

Mark the content you want cached with cache_control:

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: fs.readFileSync("./large-reference-doc.txt", "utf-8"),
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: "Summarise section 3." }],
});

The default cache TTL is 5 minutes and resets on each hit, so it stays warm as long as requests keep coming. For workloads where requests cluster but with longer gaps (a knowledge-base assistant that gets used a few times an hour, for example), pass cache_control: { type: "ephemeral", ttl: "1h" } to extend the cache lifetime to one hour at a slightly higher write cost.

The response includes cache_creation_input_tokens and cache_read_input_tokens in usage — log these to verify caching is working. If reads are zero where you expect hits, the most common cause is a non-deterministic prefix: a timestamp injected into the system prompt, a user ID interpolated before the cached block, or model name drift. Anything that makes the cached prefix vary across requests busts the cache.

Caching is worth adding any time your system prompt exceeds ~1000 tokens or you attach documents to every request. For a customer support bot with a large knowledge base, it is not optional — it is the difference between a viable cost structure and an unaffordable one.

3. Tool Use

Tool use (function calling) lets Claude decide to call a function you define, rather than answering in text. This is how you build agents that can look things up, run calculations, or take actions in external systems.

Define tools in the request:

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 1024,
  tools: [
    {
      name: "get_weather",
      description: "Get the current weather for a city.",
      input_schema: {
        type: "object",
        properties: {
          city: { type: "string", description: "City name" },
          unit: { type: "string", enum: ["celsius", "fahrenheit"] },
        },
        required: ["city"],
      },
    },
  ],
  messages: [{ role: "user", content: "What is the weather in Karachi?" }],
});

When Claude decides to call a tool, stop_reason is "tool_use" and content contains a tool_use block:

if (response.stop_reason === "tool_use") {
  const toolCall = response.content.find((b) => b.type === "tool_use");
  if (toolCall?.type === "tool_use") {
    const { city, unit } = toolCall.input as { city: string; unit?: string };
    const weatherData = await fetchWeather(city, unit); // your function

    // Send the result back
    const followUp = await anthropic.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 1024,
      tools: [...same tools...],
      messages: [
        { role: "user", content: "What is the weather in Karachi?" },
        { role: "assistant", content: response.content },
        {
          role: "user",
          content: [
            {
              type: "tool_result",
              tool_use_id: toolCall.id,
              content: JSON.stringify(weatherData),
            },
          ],
        },
      ],
    });
  }
}

The conversation structure is: user message → Claude's tool call (in the assistant turn) → tool result (in a user turn) → Claude's final response. Keeping track of this correctly is the main source of bugs when first implementing tool use.

For multi-tool agents, loop until stop_reason === "end_turn", executing any tool calls and appending results on each iteration.

4. Extended Thinking

For problems that benefit from working through intermediate steps — complex reasoning, math, planning — extended thinking gives Claude a scratchpad to think before answering. The thinking is not streamed to the user; only the final response is. But the quality of the final answer improves noticeably on hard problems.

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 16000,
  thinking: {
    type: "enabled",
    budget_tokens: 10000,
  },
  messages: [
    {
      role: "user",
      content: "Design a database schema for a multi-tenant SaaS app with usage-based billing.",
    },
  ],
});

for (const block of response.content) {
  if (block.type === "thinking") {
    console.log("Thinking:", block.thinking); // internal reasoning
  }
  if (block.type === "text") {
    console.log("Answer:", block.text);
  }
}

budget_tokens controls how long Claude can think. Higher budgets produce better answers on genuinely hard problems but cost more. For most questions, extended thinking is overkill — use it when the output quality difference is measurable, not by default.

Choosing the right model

Anthropic's current lineup:

| Model | Best for | |---|---| | claude-opus-4-7 | Hard reasoning, complex agents, extended thinking | | claude-sonnet-4-6 | Most production workloads — best cost/quality balance | | claude-haiku-4-5 | High-volume, low-latency tasks: classification, simple extraction |

Start with Sonnet. Move to Opus only if Sonnet's output quality is measurably insufficient. Move to Haiku for tasks where you need sub-second responses at scale.

What to watch in production

Three numbers to track: input_tokens, output_tokens, and cache_read_input_tokens. Input and output determine your cost. Cache reads tell you whether your caching setup is actually working — log the ratio to your dashboard and alert if it drops.

Rate limits are per-minute on both requests and tokens, and they apply per organization, not per API key. For high-volume apps, implement a queue with exponential backoff on 429 responses rather than failing immediately. The response includes anthropic-ratelimit-* headers (requests-remaining, tokens-remaining, requests-reset, tokens-reset) — back off proactively as you approach the limit rather than waiting for the 429.

For batched workloads where latency does not matter (overnight data processing, bulk summarisation), use the Message Batches API instead. Batch requests cost 50% less and have separate rate limits, so they do not compete with your real-time traffic.

Farhan Shafi

Full-Stack Developer