Comparing Local vs Cloud LLMs for Content Filtering

I have been building a home network content filter that uses AI to classify web pages in real time. The core research question is simple: can a local model running on your own hardware match a cloud API like OpenAI or Anthropic for detecting inappropriate content? And if not, how close does it get, and at what cost difference?

To answer this I built a system that evaluates every URL with multiple providers simultaneously and tracks accuracy, latency, and cost per classification. Here is what I have found so far.

The Setup

The filter sits on the network as a transparent proxy using mitmproxy. When someone visits a website, the proxy captures the HTML response and queues it for background processing. A Python worker picks up the URL, extracts the page content, and sends it to every active LLM provider for classification.

The full request lifecycle looks like this:

Browser ──► mitmproxy ──► Valkey cache check
                           │
              ┌────────────┴────────────┐
              │                         │
         Cache Hit                 Cache Miss
         (blocked)              Pass response through
              │                 Queue URL + HTML
              ▼                         │
         403 Block              Python Worker dequeues
          Page                          │
                          ┌─────────────┴──────────────┐
                          │                            │
                    Static HTML                   JS-heavy site
                    markdownify (~1ms)          crawl4ai + Chromium
                          │                            │
                          └─────────┬──────────────────┘
                                    │
                          Each active LLM config
                          classifies content
                                    │
                     ┌──────────────┼──────────────┐
                     │              │              │
                  Safe        NeedsReview      Unsafe
                  Cache it    Cache it      Auto-block rule
                                            Cache "blocked"
                                                  │
                              Publish url_analysis_completed
                                                  │
                              .NET API SSE ──► Angular reload

Each provider independently classifies the content as Safe, Unsafe, or NeedsReview. The results are stored with token counts, response latency, and calculated cost. A human review interface lets me mark classifications as correct or incorrect to build ground truth data for accuracy measurement.

Content Extraction Is the Hard Part

Before you can classify a page you need to extract its content. This sounds trivial but modern websites make it surprisingly difficult.

For static HTML pages, I use markdownify to convert the HTML to readable text. This takes about a millisecond per page and works well for blogs, news sites, and documentation.

For JavaScript-heavy sites (SPAs, dynamically loaded content), markdownify returns almost nothing because the content is rendered client-side. For these I use crawl4ai, which runs a headless Chromium browser with stealth mode to render the page, wait for content to load, and extract the visible text. This takes anywhere from 4 to 60 seconds depending on the site, but it handles anti-bot protections and Cloudflare challenges that would block a simple HTTP fetch.

The worker detects which approach to use automatically. If the raw HTML contains more than 200 characters of visible text, it uses the fast markdownify path. Otherwise it falls back to crawl4ai.

async def extract_content(url: str, raw_html: str) -> str:
    visible_text = extract_visible_text(raw_html)

    if len(visible_text) > 200:
        return markdownify(raw_html)

    # JS-heavy site, need browser rendering
    return await crawl4ai_extract(url)

When extraction fails (timeouts, anti-bot blocks, empty responses), the worker can retry with a configurable count. After max retries it marks the URL as NeedsReview so a human can look at it instead of silently dropping the classification.

Getting this extraction pipeline right mattered more than tuning the LLM prompts. A perfect classifier fed incomplete or garbled input will make bad decisions. Clean input is the foundation.

Multi-Provider Evaluation

Every URL in the system gets classified by all active LLM configurations. Right now I am running three: Ollama with a local quantized model, OpenAI GPT, and Anthropic Claude. Each provider gets the same extracted content and the same classification prompt.

The prompt asks the model to analyze the URL, domain, and extracted content, then classify it into one of three categories. The prompt includes specific definitions for content types that should be blocked (explicit material, hate speech, gambling, violence, illegal activities) and content that needs human review (ambiguous educational content, borderline social media, satirical content). This gives the models a consistent framework to work from.

Running all providers on every URL gives direct comparison data. Instead of testing one model for a week and then switching to another, I can see how they handle the exact same content side by side. The system stores every evaluation independently so I can compare provider A against provider B for the same URL at the same point in time.

The Proxy Layer

The mitmproxy addon is deliberately lightweight. It only intercepts GET requests for text/html content and skips everything else (images, CSS, JS, fonts, media files). This keeps the noise down and avoids wasting LLM tokens on non-content resources.

For each intercepted request, the addon captures the URL, the base64-encoded HTML response, and metadata like the source IP (for device-aware filtering), TLS fingerprint, and SNI. The bypass detection worker on the .NET side uses this metadata to identify VPN tunnels, DNS-over-HTTPS traffic, DNS-over-TLS, and proxy-avoidance attempts.

The proxy also checks Valkey for a cached classification before passing the response through. If a URL was previously classified as blocked, the user gets a 403 block page immediately. This makes repeat visits instant regardless of how long the initial classification took.

Observability

Every component in the system emits telemetry via OpenTelemetry. Traces propagate across the Python worker, .NET API, and mitmproxy with correlation IDs so I can follow a single URL from interception through extraction and classification.

Custom metrics track the operational health of the pipeline:

awf.queue.processed counts URLs dequeued
awf.extraction.latency measures how long extraction takes by method (markdownify vs crawl4ai)
awf.analysis.latency measures LLM inference time by provider
awf.llm.cost_usd tracks cumulative API spend
awf.llm.token_usage records token counts per evaluation

SigNoz collects all of this and provides dashboards for monitoring. When an extraction starts taking longer than expected or a provider starts returning errors, I can see it in the traces and drill down to the root cause.

What the Numbers Show

The results are still early (the system has been running for a few weeks) but some patterns are clear.

On obvious content, local models are competitive. Explicit adult sites, gambling portals, and known malware domains get correctly classified by the local Ollama model at rates close to the cloud APIs. These pages tend to have unambiguous content that even a smaller model can identify.

On ambiguous content, cloud APIs pull ahead. Educational health content, news articles covering sensitive topics, satire, and borderline social media pages are where the local model struggles. It tends to over-classify borderline content as unsafe. The cloud models show more nuance in distinguishing context.

Agreement rate is a useful signal. When all three providers agree on a classification, they are almost always correct. Disagreement is a strong indicator that the content is ambiguous and might benefit from human review. The research dashboard tracks agreement rate across all evaluated URLs.

Cost scales linearly for cloud APIs, is fixed for local. The local model costs nothing per classification (just electricity and hardware wear). Cloud APIs charge per token, so cost grows with traffic. At current prices, the crossover point where local becomes cheaper is surprisingly low for a home network with a few active users.

Human Review for Ground Truth

Automated accuracy numbers are meaningless without a ground truth baseline. The system includes a human review page where I can look at each classification and mark it correct or incorrect. From this data it calculates accuracy, precision, recall, and F1 scores per provider.

This was important to build early. Without it I would be comparing providers against each other instead of against a known-correct answer. Two models agreeing does not mean they are both right.

The review interface also surfaces disagreements (cases where one provider said Safe and another said Unsafe) which tend to be the most interesting and instructive cases. A confusion matrix on the research dashboard shows how each provider's predictions map to the actual ground truth, broken down by Safe, Unsafe, and NeedsReview.

Cost Tracking

Each LLM provider has a cost model configured with input and output token prices. Every classification records the token count, and the system calculates the cost automatically. The research dashboard shows cumulative spend per provider and cost per classification.

For a home network processing maybe 50 to 100 unique URLs per day, the cloud API costs are manageable (a few cents per day at most). But the gap widens with traffic. A school or small office filtering thousands of URLs per day would see meaningful savings from a local model, assuming the accuracy tradeoff is acceptable.

What I Would Do Differently

If I were starting over, I would invest more time upfront in the content extraction pipeline. I underestimated how much modern web architecture (SPAs, lazy loading, anti-bot measures) would complicate getting clean text out of pages. The AI classification part is comparatively straightforward once you have good input.

I would also add per-category accuracy tracking earlier. Knowing that a model has 90% overall accuracy is less useful than knowing it has 99% accuracy on gambling sites but only 70% on borderline social media content. Category-level metrics make it clear where each model excels and where it needs help.

The system is still running and collecting data. The longer it runs, the more confident the accuracy benchmarks become. That is the advantage of building this as a research platform rather than just a filter. Every URL analyzed makes the comparison data more robust.