Threat Intel Enrichment with STIX/TAXII and Python

Posted Oct 30, 2025

By Nathan Berg

6 min read

Threat intelligence is only useful if you can operationalize it. STIX provides a standard data model for indicators, and TAXII provides an API to move those indicators between systems. In a lab, you can build a small enrichment pipeline that pulls indicators from a TAXII feed and matches them against your logs.

The goal of this post is to show a minimal, practical workflow using Python. We will fetch indicator objects from a TAXII server, store them locally, and match them against DNS and HTTP logs.

STIX and TAXII in plain terms

STIX is a JSON schema for cyber threat intelligence. A STIX indicator object might hold a file hash, IP address, domain, or URL pattern. TAXII is the transport API that lets you query and download STIX bundles from a collection.

A minimal pipeline looks like this:

Connect to a TAXII collection.
Pull recent indicators.
Extract observable values and types.
Match against local telemetry.

Python setup

Install the basic clients. The taxii2-client library handles the API, and stix2 parses bundles.

pip install taxii2-client stix2

Fetch indicators

The script below connects to a TAXII server, pulls recent indicators, and stores them in a local JSON file. Replace the discovery URL with your preferred feed.

  
from taxii2client.v20 import Server
from stix2 import parse
from datetime import datetime, timedelta
import json

DISCOVERY_URL = "https://example-taxii-server.com/taxii/"

server = Server(DISCOVERY_URL)
api_root = server.api_roots[0]
collection = api_root.collections[0]

since = (datetime.utcnow() - timedelta(days=7)).isoformat() + "Z"
objects = collection.get_objects(added_after=since)["objects"]

indicators = []
for obj in objects:
    if obj.get("type") == "indicator":
        indicators.append(obj)

with open("indicators.json", "w") as f:
    json.dump(indicators, f)

print(f"Fetched {len(indicators)} indicators")

Parse indicator patterns

STIX patterns can be complex, but you can start with simple extraction for IPs, domains, and URLs. A basic parser can be built with regex for common patterns, or you can use a STIX pattern parser if you need full fidelity.

Here is a simple extraction approach:

  
import re
import json

pattern_rx = re.compile(r"\[(?P<object>[^:]+):(?P<field>[^ ]+) = '(?P<value>[^']+)'\]")

with open("indicators.json") as f:
    indicators = json.load(f)

observables = []
for ind in indicators:
    m = pattern_rx.search(ind.get("pattern", ""))
    if not m:
        continue
    observables.append({
        "type": m.group("object"),
        "field": m.group("field"),
        "value": m.group("value")
    })

with open("observables.json", "w") as f:
    json.dump(observables, f)

Match against logs

Assume you have Zeek DNS logs in JSON. You can load the observables and compare them against query names.

  
import json

with open("observables.json") as f:
    obs = json.load(f)

bad_domains = {o["value"] for o in obs if o["type"] == "domain-name"}

hits = []
with open("/opt/zeek/logs/current/dns.log") as f:
    for line in f:
        rec = json.loads(line)
        q = rec.get("query")
        if q and q in bad_domains:
            hits.append({"query": q, "src": rec.get("id.orig_h"), "ts": rec.get("ts")})

print("Hits:", len(hits))

In a SIEM, you can push these indicators into a lookup table and join them at query time, which is more scalable.

Indicator lifecycle and expiry

Threat intel is perishable. Many indicators go stale quickly, especially IP addresses and dynamic domains. Store a first_seen and last_seen timestamp for each indicator and expire them after a fixed window unless the feed refreshes them. This prevents your enrichment from being dominated by outdated data.

If your feed provides confidence or TTL values, use them to drive expiry. A low confidence indicator should not have a long life in your system. This simple policy reduces false positives and keeps your enrichment focused.

Matching strategies and pitfalls

Exact matching is easy but brittle. For domains, consider normalizing to lowercase and stripping trailing dots. For URLs, consider parsing the hostname and path separately, because indicators may only specify a domain or a path prefix.

Be careful with wildcards. A pattern like *.example.com can match legitimate traffic if the domain is shared. In a lab, test wildcard behavior by generating both benign and malicious-looking subdomains and verify that you do not create excessive false positives.

Scoring and prioritization

Not all matches are equal. Combine indicator confidence with local context. For example, a match on a rare domain from a high value host should score higher than a match from a lab VM. You can implement a simple scoring system: base score from the indicator, plus points for host criticality, plus points for unusual ports or user agents.

Even a rough score helps analysts triage. It is better to see five high confidence matches than fifty low value hits.

Caching and rate limits

Many TAXII servers enforce rate limits. Cache responses locally and only request deltas. This reduces load on the feed and keeps your pipeline reliable. If you poll every hour, store the last seen timestamp and use added_after so you only fetch new objects.

In a lab, a daily pull is usually enough. This also helps you reason about which indicators are fresh and which are stale.

Feed access and OPSEC

Some feeds require authentication or API tokens. Treat these as secrets and store them outside of source control. If you run your lab in containers, use environment variables or a local secrets file with restricted permissions.

Also remember that pulling from some feeds can signal your interest. If you are studying specific threats, consider using a mirror or a local cache to avoid repeated queries that might appear unusual.

Lab validation

For a lab demo, create a fake indicator that matches a domain you control and then generate a DNS query. Confirm the pipeline flags it. This helps you validate end to end flow without relying on real threat intel.

Lab checklist

Use this to validate your enrichment pipeline:

Confirm TAXII access works and only new indicators are fetched.
Normalize indicator values and verify case handling for domains.
Run a test lookup against Zeek logs and confirm a match.
Expire old indicators and verify they no longer match.

Operational tips

Store indicators with timestamps and TTLs so you can expire them.
Tag the source feed and confidence for each indicator.
Avoid blocking on every indicator. Use them for enrichment and triage first.
Keep a small allowlist for known false positives like CDN domains.

Takeaways

STIX and TAXII are not just enterprise toys. With a few Python scripts, you can build an enrichment pipeline that adds real context to your lab telemetry. Start simple, focus on indicators that match your log sources, and scale only when the data quality is consistent.

blog

This post is licensed under CC BY 4.0 by the author.