URL rewrites: kill the upstream host.
Without rewriting, Google and Open Graph crawlers index your origin hosts. The fix is small but mandatory: rewrite every absolute URL in the response that points at an origin back to the canonical host.
What to rewrite
<link rel="canonical" href="…"><link rel="alternate" hreflang="…" href="…"><meta property="og:url" content="…">and otherog:*URLs- JSON-LD blocks (
"url": "https://origin…") - The
LinkHTTP response header (when present)
How to rewrite it
Two common approaches. For small payloads, a regex pass is fine. For large HTML, use
Cloudflare's HTMLRewriter — it streams and is far more memory-efficient.
Quick & dirty: regex replace
function rewriteBody(html, fromHost, toHost) {
const re = new RegExp(`https?://${fromHost}`, 'g');
return html.replace(re, `https://${toHost}`);
} The right way: streaming HTMLRewriter
function rewriteResponse(upstream, fromHost, toHost) {
return new HTMLRewriter()
.on('link[rel="canonical"], link[rel="alternate"]', {
element(el) {
const href = el.getAttribute('href');
if (href) el.setAttribute('href', href.replace(fromHost, toHost));
},
})
.on('meta[property^="og:"]', {
element(el) {
const v = el.getAttribute('content');
if (v && v.includes(fromHost)) el.setAttribute('content', v.replace(fromHost, toHost));
},
})
.transform(upstream);
} Try it: see the rewrite in action
Edit the config, then enter a request URL to see what the Worker would do. Nothing is sent over the network — this is a faithful reimplementation of the same rules in the browser.
Path rules (first match wins)
Toggle "Rewrite canonical / og:url / hreflang" off and re-run. The simulator
shows the upstream host leaking through. That's what a crawler would index.
Don't forget the Link header
Some frameworks set Link: <…>; rel="canonical" as a response header. Easy to miss
because it isn't in the body.
const link = upstream.headers.get('link');
if (link && link.includes(fromHost)) {
upstream.headers.set('link', link.replaceAll(fromHost, toHost));
}