Step 3 of 6

URL rewrites: kill the upstream host.

Without rewriting, Google and Open Graph crawlers index your origin hosts. The fix is small but mandatory: rewrite every absolute URL in the response that points at an origin back to the canonical host.

What to rewrite

How to rewrite it

Two common approaches. For small payloads, a regex pass is fine. For large HTML, use Cloudflare's HTMLRewriter — it streams and is far more memory-efficient.

Quick & dirty: regex replace

function rewriteBody(html, fromHost, toHost) {
  const re = new RegExp(`https?://${fromHost}`, 'g');
  return html.replace(re, `https://${toHost}`);
}

The right way: streaming HTMLRewriter

function rewriteResponse(upstream, fromHost, toHost) {
  return new HTMLRewriter()
    .on('link[rel="canonical"], link[rel="alternate"]', {
      element(el) {
        const href = el.getAttribute('href');
        if (href) el.setAttribute('href', href.replace(fromHost, toHost));
      },
    })
    .on('meta[property^="og:"]', {
      element(el) {
        const v = el.getAttribute('content');
        if (v && v.includes(fromHost)) el.setAttribute('content', v.replace(fromHost, toHost));
      },
    })
    .transform(upstream);
}

Try it: see the rewrite in action

Edit the config, then enter a request URL to see what the Worker would do. Nothing is sent over the network — this is a faithful reimplementation of the same rules in the browser.

Path rules (first match wins)

Toggle "Rewrite canonical / og:url / hreflang" off and re-run. The simulator shows the upstream host leaking through. That's what a crawler would index.

Don't forget the Link header

Some frameworks set Link: <…>; rel="canonical" as a response header. Easy to miss because it isn't in the body.

const link = upstream.headers.get('link');
if (link && link.includes(fromHost)) {
  upstream.headers.set('link', link.replaceAll(fromHost, toHost));
}