Post

Log Clustering and Triage with LMStudio, Python, and SQLite

Log Clustering and Triage with LMStudio, Python, and SQLite

When logs are noisy, clustering is more useful than reading line by line. A small clustering pipeline can group similar messages, let you triage the rare clusters, and then use a local LLM to summarize each cluster. LMStudio makes it easy to keep the model local, and SQLite gives you a lightweight store that works well in a home lab.

This post builds a simple pipeline: parse logs, cluster with TF-IDF + KMeans, store clusters in SQLite, and use LMStudio to summarize each cluster for quick review.

Parse logs into structured rows

Start with a log file and normalize it. Even a simple parse that splits timestamps and messages is enough for clustering.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import re
from pathlib import Path

def parse_syslog(line):
    m = re.match(r"^([A-Z][a-z]{2} +\d+ \d+:\d+:\d+) (\S+) (.+)$", line)
    if not m:
        return None
    return {"ts": m.group(1), "host": m.group(2), "msg": m.group(3)}

rows = []
for line in Path("/var/log/syslog").read_text().splitlines():
    rec = parse_syslog(line)
    if rec:
        rows.append(rec)

You can replace this with Zeek or Suricata logs by extracting the message field you want to cluster.

Cluster messages with TF-IDF

Use scikit-learn to vectorize the messages and cluster them. KMeans is fine for a lab; you can swap it for HDBSCAN if you want automatic cluster counts.

1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

texts = [r["msg"] for r in rows]
vec = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
X = vec.fit_transform(texts)

k = 12
model = KMeans(n_clusters=k, random_state=42)
labels = model.fit_predict(X)

for r, label in zip(rows, labels):
    r["cluster"] = int(label)

Store in SQLite

SQLite is perfect for small labs and makes it easy to query clusters over time.

1
2
3
4
5
6
import sqlite3

conn = sqlite3.connect("logclusters.db")
conn.execute("CREATE TABLE IF NOT EXISTS logs (ts TEXT, host TEXT, msg TEXT, cluster INT)")
conn.executemany("INSERT INTO logs VALUES (:ts, :host, :msg, :cluster)", rows)
conn.commit()

Now you can query clusters that are rare or newly appearing.

1
2
3
4
5
SELECT cluster, COUNT(*) as cnt
FROM logs
GROUP BY cluster
ORDER BY cnt ASC
LIMIT 5;

Summarize clusters with LMStudio

Take a sample of messages from a cluster and feed them to LMStudio for summarization. Keep the prompt strict and structured.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests

API_URL = "http://127.0.0.1:1234/v1/chat/completions"

prompt = """You are a SOC analyst. Summarize this cluster in 5 bullets.
Include: likely cause, affected services, and whether it looks suspicious.
Return JSON with fields: summary, suspicion, recommended_action."""

def summarize_cluster(cluster_id):
    sample = [r["msg"] for r in rows if r["cluster"] == cluster_id][:80]
    messages = [
        {"role": "system", "content": prompt},
        {"role": "user", "content": "\n".join(sample)}
    ]
    payload = {"model": "local-model", "messages": messages, "temperature": 0.2, "max_tokens": 400}
    resp = requests.post(API_URL, json=payload, timeout=60)
    resp.raise_for_status()
    return resp.json()["choices"][0]["message"]["content"]

Store these summaries in a separate SQLite table so you can review them daily.

Feature engineering tips

Clustering quality depends on the features. For logs, remove high-cardinality tokens like UUIDs, PIDs, and timestamps. These tokens create artificial differences between messages that are actually the same event. You can pre-process with regex to replace them with placeholders like <UUID> or <PID>.

Also consider splitting by log source before clustering. System logs, authentication logs, and application logs have different vocabularies. Clustering them together reduces quality. Separate them, cluster within each group, and then compare results.

Cluster drift and retraining

Logs change over time. Software updates introduce new message formats, which can break old clusters. Re-run clustering regularly and compare cluster centroids across days. If a cluster suddenly changes, that is either an update or a new behavior worth investigation.

In a lab, you can schedule clustering nightly and store the model with a timestamp. That gives you a small audit trail and lets you reproduce a cluster when you need to investigate later.

SQLite indexing for speed

If you store a lot of logs, add indexes on cluster, ts, and host. This makes it fast to query rare clusters or filter by time window.

1
2
CREATE INDEX IF NOT EXISTS idx_logs_cluster ON logs(cluster);
CREATE INDEX IF NOT EXISTS idx_logs_ts ON logs(ts);

SQLite is fast enough for small labs, but indexes make it feel instant.

Embeddings for semantic clustering

TF-IDF works well for structured logs, but it struggles with paraphrased or semi-structured messages. If you want more semantic grouping, generate embeddings for each message and cluster those vectors. You can run a small embedding model locally or use a sentence transformer.

The pipeline is similar: compute embeddings, store them, then cluster with HDBSCAN or KMeans. This is heavier on CPU, but it tends to group related log messages even if the wording changes.

Evaluating cluster quality

Clustering is unsupervised, so you need a sanity check. Sample a few messages from each cluster and verify that they are actually similar. If a cluster looks mixed, reduce k or adjust vectorization parameters.

In a lab, keep a small set of labeled events and see if they land in the same cluster. This gives you a rough precision check without building a full training dataset.

Prompt injection and trust boundaries

If you summarize clusters with an LLM, treat the logs as untrusted input. A crafted log entry could include instructions that try to override your prompt. Keep the system prompt strict, avoid tool execution, and strip or escape user-controlled fields when possible.

For sensitive workflows, consider generating summaries locally and never sending them to an external service. LMStudio makes this easy, and it keeps your lab data private.

Practical triage workflow

  1. Run clustering on the last 24 hours of logs.
  2. Sort clusters by size and rarity.
  3. Summarize the smallest clusters first.
  4. Investigate any cluster that mentions auth failures, unusual ports, or binary execution.

This workflow turns a giant log file into a short list of items you can review in minutes.

Lab checklist

Use this checklist to keep the pipeline healthy:

  • Strip high-cardinality tokens before vectorization.
  • Rebuild clusters after major software updates.
  • Store cluster summaries alongside raw samples.
  • Review the smallest clusters daily for new anomalies.

Takeaways

Clustering plus local LLM summaries is a practical way to triage logs in a home lab. The clustering step removes repetition, and the LLM gives you a human-readable synopsis. Keep the model local, keep the prompts strict, and use SQLite as a lightweight backbone for your analysis.

This post is licensed under CC BY 4.0 by the author.