An anomaly detector for Zeek logs. It supports both classic Zeek TSV logs and line-delimited Zeek JSON logs, can process a single log or a directory of logs, and applies different anomaly detection strategies depending on the log type.
For a full architecture and methodology walkthrough, open docs/tool-explainer.html in a browser. That page explains the end-to-end data flow, feature engineering, models, directory scoring, baseline training, outputs, and design rationale in one place.
This is no longer a conn.log-only PCA script. The current implementation does all of the following:
- Reads
conn.log,http.log,files.log,ssh.log,weird.log,notice.log,known_services.log,known_hosts.log,software.log,arp.log,stats.log,capture_loss.log,packet_filter.log, and any other Zeek logs that match the supported schemas. - Auto-detects input format as Zeek TSV or Zeek JSON.
- Processes a full directory of
.logfiles with one command. - Builds shared context across logs using
uidandfuid. - Chooses a detector per log type instead of forcing the same model on every schema.
- Computes a directory-level maliciousness score at the end when running on a directory.
Different Zeek logs represent different kinds of evidence:
conn.logis flow-oriented and benefits from multivariate anomaly detection.http.logandfiles.logcontain application and content metadata that are useful for spotting scans, abuse, and unusual transfers.ssh.logis sparse but still useful when you model client/server banner rarity and cross-log context.weird.logandnotice.logare already event-like and are better handled with rarity and prioritization than with a generic PCA model.known_hosts.log,known_services.log, andsoftware.logare inventory/state logs. They are useful for novelty detection, not for classic flow outlier detection.stats.logandcapture_loss.logare time-series/system telemetry and should be scored as deviations over time, not as independent flow records.
Using one global model across all of them produces poor results and unstable behavior. The current implementation uses the structure of each Zeek log type instead.
The tool accepts either:
- A single file with
-f - A directory with
-d
For every file, it:
- Detects Zeek TSV or line-delimited JSON automatically
- Loads the log into a Pandas dataframe
- Keeps the original Zeek fields for display
- Builds detector-specific numeric features separately from the raw log fields
The detector does not score each log in isolation only. It first loads all logs in the directory and builds shared context.
uid is the main Zeek transaction identifier used to tie together related activity across logs such as:
conn.loghttp.logfiles.logssh.logweird.log
The tool aggregates per-uid context such as:
- Number of related records in each log type
- Related connection bytes, packets, duration, and state rarity from
conn.log - Related HTTP count, body sizes, and status rarity from
http.log - Related file count, total bytes, and MIME rarity from
files.log - Related weird-event count and weird-name rarity from
weird.log - Related SSH counts and auth attempts from
ssh.log
These aggregated values are then injected back into the per-log feature vectors.
This matters because a record that looks only mildly unusual in one log can become much more suspicious if:
- The same
uidalso triggered weird events - The same
uiddownloaded a rare file - The same
uidhad an abnormal HTTP status pattern - The same
uidis tied to a high-byte or unusualconn.logflow
fuid is the Zeek file identifier used to tie file activity across logs. The current implementation uses it mainly to enrich http.log with file context from files.log, including:
- Linked file count
- Linked file total bytes
- Linked MIME rarity
- Linked file source rarity
This is useful when an HTTP request is suspicious because of what it delivered, not just because of the request metadata itself.
The implementation uses three detector families:
IsolationForestfor rich multivariate logs- Rarity scoring for event/inventory logs
- Time-series deviation scoring for telemetry logs
If scikit-learn is not installed, the IsolationForest path falls back to a standardized distance score instead of crashing.
When you run the tool on a Zeek directory with -d, it does not stop at printing per-file anomalies. After all files are processed, it computes a directory-level maliciousness score intended to help distinguish:
- A mostly normal Zeek directory that still contains a few odd records
- A malicious or attack-heavy directory where anomalies are broader, more correlated, and concentrated in the attack-relevant logs
This score is printed at the end as a separate Directory Summary.
Because raw scores are not directly comparable:
- Different log types use different detectors
- Different detectors produce different score scales
- File size and feature spread affect score magnitude
- A benign directory can still have a few high local anomalies
So the directory score does not use raw totals directly.
The directory score combines the anomaly summary with a behavior profile learned from the whole directory.
The main components are:
weighted_top
- For each log, the tool computes the mean percentile rank of the top anomalous rows inside that log
- Those values are weighted by log importance
- Attack-relevant logs such as
conn,http,files,tls,weird, andnoticehave higher weight than inventory logs such asknown_hosts
uid_correlation
- Counts anomalous
uidvalues that appear in two or more log types - Gives extra weight to anomalous
uidvalues seen in three or more log types
This is one of the most important signals, because coordinated activity across logs is more indicative of real malicious behavior than isolated anomalies.
anomaly_fraction
- Measures how much of each log is being flagged anomalous
- Uses a weighted, normalized anomaly fraction across files
This helps distinguish “a couple of odd rows” from “a large portion of relevant activity looks strange”.
weird_notice
- Adds weight when
weird.logornotice.logalso contain anomalous rows
This matters because these logs already represent unusual or alert-like behavior and often reinforce attack evidence.
fuid_overlap
- Adds weight when anomalous
http.logtransactions are linked throughfuidto anomalousfiles.logrecords
This is useful for suspicious content delivery, payload transfer, and file-backed HTTP anomalies.
behavior_score
- Builds a source-level behavior profile from
conn.log - Measures broad scan-like activity such as:
- high destination-port fanout
- large per-destination port sweeps
- high failed-connection fraction
- short, zero-payload, service-missing connection patterns
- Keeps the strongest behavioral outliers and prints them in the
Directory Summary
This is the part that lets the tool detect broad campaigns like simple Nmap scans even when they do not generate much uid overlap in higher-level logs.
The current implementation uses this weighted combination:
core_score =
100 * (
0.35 * weighted_top +
0.25 * uid_correlation +
0.20 * anomaly_fraction +
0.15 * weird_notice +
0.05 * fuid_overlap
)
directory_score = min(100, core_score + 45 * behavior_score)
The final result is shown on a 0-100 scale and labeled as:
LOWMEDIUMHIGH
These labels are intended for triage, not as a calibrated probability of compromise.
You can train thresholds from known-normal Zeek directories by passing one or more --normal-dir values during a directory run.
Example with one normal directory:
python3 zeek-anomaly-detector.py \
-d /path/to/suspect/zeek \
-N /path/to/known-normal/zeekExample with multiple normal directories:
python3 zeek-anomaly-detector.py \
-d /path/to/suspect/zeek \
-N /path/to/normal1 \
-N /path/to/normal2 \
-N /path/to/normal3If you only want a single final line for the directory score, use --summary-line:
python3 zeek-anomaly-detector.py \
-d /path/to/suspect/zeek \
--summary-lineWith a normal baseline, the same one-line output also includes the baseline verdict:
python3 zeek-anomaly-detector.py \
-d /path/to/suspect/zeek \
-N /path/to/normal1 \
-N /path/to/normal2 \
--summary-lineThe best approach is not to learn a hard threshold from a single raw anomaly score. Normal Zeek directories vary naturally because of:
- Different traffic volumes
- Different protocol mix
- Different scanning and discovery noise
- Different host inventories
- Different capture durations
So the tool learns thresholds from the directory-summary components instead of per-row raw scores.
It computes the normal baseline on:
scoreweighted_topweighted_fractionuid_corr_scoreweird_notice_bonusfuid_bonus- cross-log overlap counts
When multiple normal directories are provided, the threshold for each metric is learned with robust statistics:
- median
- MAD-based upper bound
When only one or two normal directories are provided, the tool falls back to a conservative margin above the observed normal values.
This is not as strong as training on many normal directories, but it is still better than using one global fixed threshold.
When --normal-dir is used, the final output includes a Baseline Comparison section that says whether the current directory is:
WITHIN NORMAL BASELINESUSPICIOUS VS BASELINEABOVE NORMAL BASELINE
It also prints which summary metrics exceeded the learned normal thresholds.
When --summary-line is used, the normal terminal output is suppressed and replaced by one final tab-separated line with:
- Input path
- Final directory score, colorized in terminals that support ANSI colors
- Severity, colorized in terminals that support ANSI colors
- Baseline verdict, if
--normal-dirwas used, also colorized in ANSI-capable terminals - Number of normal directories used for the baseline, if any
Technique: IsolationForest
Why:
conn.logis the closest thing to classic flow anomaly detection.- Attacks often appear as unusual combinations of bytes, packets, ports, service, connection state, and duration.
- Multivariate detection is more appropriate than per-feature thresholding.
Main features include:
- Destination port
- Duration
- Total bytes
- Total packets
- Originator/responder byte ratio
- Originator/responder packet ratio
- Bytes per second
- Bytes per packet
- Port rarity
- Service rarity
- Connection-state rarity
- History rarity
- Destination-host popularity
- Related
uidcontext from HTTP, files, SSH, and weird logs
This is the best log for finding scan activity, strange connection fan-out, failed probes, weird size ratios, or traffic that does not match the rest of the environment.
Technique: IsolationForest
Why:
- Malicious HTTP behavior is usually a combination of method, URI, status, body sizes, host rarity, and user-agent weirdness.
- Single-value thresholds are weak here.
- Cross-log correlation matters because the delivered file can be more suspicious than the HTTP line itself.
Main features include:
- Destination port
- Transaction depth
- Request and response body length
- Status code
- URI length
- Host length
- User-agent length
- Method rarity
- Status rarity
- Host rarity
- URI rarity
- User-agent rarity
- Count of linked response and originator file IDs
- Linked file counts, linked file bytes, and linked file MIME rarity through
fuid - Related
uidconnection and weird-event context
This helps surface scanning, unusual methods, suspicious paths, odd user agents, and HTTP transactions associated with rare or suspicious files.
Technique: DNS-specific hybrid score
Why:
- DNS abuse often shows up as lexical anomalies, response-pattern anomalies, or repeated bursts of algorithmic-looking domains from the same source host.
- DGA traffic is rarely visible from a single field only. It is usually a combination of domain randomness, TLD choice, no-answer behavior, and repeated source-side querying patterns.
- Generic outlier detection tends to over-rank benign mDNS and reverse-lookup traffic, so the DNS detector uses a custom score instead.
Main features include:
- Destination port
- Query length
- Label count
- First-label length
- Query entropy
- Unique-character ratio
- Vowel ratio
- Consonant ratio
- Digit ratio
- Query rarity
- TLD rarity
- Query-type rarity
- Response-code rarity
- Answer count
- TTL count
- No-answer flag
- Rejected flag
dga_likelexical heuristicdga_pattern_countfor repeated DGA-like patternssrc_dga_like_countfor repeated DGA-like queries from the same source hostis_mdnsis_local_tldis_reverse_lookupis_service_discovery- Related conn/weird context by
uid
The DNS detector explicitly tries to capture DGA-like behavior. It does not rely on a signature list. Instead, it uses lexical and repetition features such as:
- Long first labels
- High character entropy
- High unique-character ratio
- Low vowel ratio or noticeable digit presence
- Repeated queries of similarly structured random-looking domains from the same source host
This means domains such as:
kvcjsnsd.ruafajgvcnm.ruwtkfidatyhc.ru
will not only look suspicious individually, but repeated appearances of the same DGA-like pattern from the same source host will increase the anomaly score further.
The detector also explicitly downweights benign local-resolution traffic such as:
- mDNS on port
5353 .localnamesin-addr.arpaip6.arpa- service-discovery names such as
_googlecast._tcp.local
That is intentional, so DGA-like domains rank above local multicast noise.
Technique: IsolationForest
Why:
- File transfers are often suspicious because of size, MIME type, source, timeout behavior, or mismatch with related activity.
- File metadata is rich enough for multivariate outlier detection.
Main features include:
- Destination port
- Depth
- Duration
- Seen bytes
- Total bytes
- Missing bytes
- Overflow bytes
local_origis_origtimedout- MIME rarity
- Source rarity
- Analyzer count
- Byte gap between seen and total
- Related HTTP, conn, and weird context by
uid
This is useful for surfacing rare files, unusual transfer sizes, suspicious extracted content, and file transfers linked to strange HTTP sessions.
Technique: IsolationForest
Why:
- SSH logs are relatively sparse, but still useful for detecting unusual client banners, server banners, auth behavior, and correlation with suspicious connection context.
Main features include:
- Destination port
- Auth attempts
- Client string length
- Server string length
- Client rarity
- Server rarity
- Related connection, weird, and HTTP/file context by
uid
This helps highlight scans, banner anomalies, and behavior linked to other suspicious events.
Technique: IsolationForest
Why:
- TLS metadata is usually best handled as multivariate fingerprint-style anomaly detection.
Main features include:
- Destination port
- TLS version
- Cipher count
- Server-name length
- JA3 rarity
- JA3S rarity
- SNI rarity
- Related connection and weird context by
uid
Note: if your tls.log does not contain JA3, JA3S, or SNI-like fields, the detector will use whatever TLS metadata exists. If there is no tls.log in the directory, nothing special happens.
Technique: rarity scoring
Why:
weird.logalready records unusual protocol or parser behavior.- The right question is not “is this vector an outlier?” but “how rare and how correlated is this weird event?”
Main features include:
- Destination port
- Notice flag
- Weird-name rarity
- Source-module rarity
- Peer rarity
- Related conn/http/files/ssh context by
uid
This is useful for surfacing weird events that are both rare and tied to suspicious sessions.
Technique: rarity scoring
Why:
notice.logis already a higher-level detection stream.- It should be prioritized, not modeled like raw traffic.
Main features include:
nsuppress_for- Notice-type rarity
- Source rarity
- Message length
This helps rank notices rather than replace Zeek’s own detection logic.
Technique: rarity scoring
Why:
- This log is inventory-like.
- It is useful for novelty detection such as unusual service/port exposure.
Main features include:
- Port number
- Service rarity
- Host rarity
- Transport rarity
This can surface unusual service exposure or drift in observed services.
Technique: rarity scoring
Why:
- This is host inventory, not flow telemetry.
- The meaningful signal is host novelty and timing irregularity.
Main features include:
- Host rarity
- Time-gap deviation between observations
This is useful for new host discovery, churn, or unusual host appearance timing.
Technique: rarity scoring
Why:
- This log describes discovered software and versions, which is mostly inventory.
- Rare software/version combinations are often more useful than geometric outlier detection.
Main features include:
- Host port
- Major/minor version
- Software-type rarity
- Product-name rarity
- Additional-version rarity
- Unparsed version length
This helps surface unusual software/version fingerprints.
Technique: rarity scoring
Why:
- ARP activity is short, structured, and often better handled with novelty-style scoring.
- Suspicion often comes from unusual request/reply patterns or MAC/IP rarity.
Main features include:
- Operation rarity
- Source-MAC rarity
- Destination-MAC rarity
- Broadcast-request flag
- Originator-IP rarity
- Responder-IP rarity
This is useful for flagging strange ARP activity, especially in lab or small networks.
Technique: time-series deviation scoring
Why:
stats.logis telemetry about Zeek itself and overall traffic processing.- These are time-evolving counters and gauges, not flow records.
- Raw counters by themselves are not enough. The more meaningful signal is in workload ratios, queue pressure, protocol mix, file-extraction intensity, and growth rates.
Main features now include operational ratios and rates such as:
- Memory
- Events queued
- Active connections
- Active files
- Active DNS requests
- Total reassembly size
- Bytes per packet
- Events per packet
- Queue-to-processed ratio
- Active-to-total connection ratio
- TCP, UDP, and ICMP share
- Files per connection
- Active files per connection
- DNS requests per UDP connection
- Active DNS pressure
- Reassembly per TCP connection
- Timer pressure
- Memory per packet
- Packet, byte, event, queue, connection, file, and DNS growth rates
- Memory delta
- Queue delta
- Connection-mix delta
The score is still time-series based, but it now operates on these derived operational features. That makes stats.log anomalies more meaningful in Zeek terms: queue buildup, workload-shape changes, abnormal protocol mix shifts, unusual file or DNS intensity, and abrupt processing-pressure changes.
Technique: time-series deviation scoring
Why:
- Packet loss and capture gaps are time-dependent monitoring signals.
Main features include:
ts_deltagapsackspercent_lost
Technique: rarity scoring
Why:
- This is configuration/state metadata, not traffic flow data.
Main features include:
initsuccess- Filter rarity
- Node rarity
loaded_scripts.log is ignored completely.
Why:
- It reflects Zeek runtime configuration, not network behavior.
- In practice it tends to add noise to directory summaries and plots without helping attack detection.
- If you keep it in a Zeek directory, it is skipped before loading, so it does not affect anomalies, JSON output, plots, or the final directory score.
Default output is intentionally minimal:
- Only anomaly blocks are printed
- One block per log file that produced anomalies
- Each block is labeled with the file name
- Every printed anomaly row includes a numeric
score - In directory mode, a final
Directory Summaryis printed at the end
Verbose and debug output add:
- Detector name
- Used feature columns
- Feature samples in debug mode
Important: every detector produces a numeric score, but the meaning depends on the detector family:
IsolationForest: higher score means the row is more isolated from the rest of that log's feature distribution- Rarity scoring: higher score means the row contains rarer values or combinations in that log
- Time-series scoring: higher score means the row deviates more strongly from the time-series level and/or change pattern
These are ranking scores inside each log type, not calibrated probabilities, and they should not be compared numerically across different Zeek logs. A score from conn.log should not be compared directly to a score from http.log.
At the end of a directory run, the tool prints:
- A severity label:
LOW,MEDIUM, orHIGH - A directory maliciousness score on a
0-100scale - The normalized component values used to build the score
- The number of anomalous
uidvalues shared across multiple logs - The number of anomalous HTTP/file
fuidoverlaps - The top contributing logs and their weighted contribution
This final block is the best place to compare one Zeek directory against another. It is more reliable than summing raw row scores because it includes normalization and cross-log correlation.
Clone the repository:
git clone --recurse-submodules --remote-submodules https://github.com/stratosphereips/zeek_anomaly_detector
cd zeek_anomaly_detectorInstall the dependencies:
pip install -r requirements.txt
pip install scikit-learnNotes:
pandasandnumpyare required.scikit-learnis strongly recommended becauseIsolationForestis used for the richer multivariate logs.- If
scikit-learnis missing, the script falls back to a simpler distance-based score for those logs.
If you use Docker, make sure the image includes scikit-learn in addition to the Python dependencies.
Example:
docker run --rm -it \
-v /full/path/to/logs:/logs \
stratosphereips/zeek_anomaly_detector:latest \
python3 zeek-anomaly-detector.py -d /logsRun on one Zeek log:
python3 zeek-anomaly-detector.py -f dataset/001-zeek-scenario-malicious/conn.logShow the top 20 anomalies:
python3 zeek-anomaly-detector.py -f dataset/001-zeek-scenario-malicious/conn.log -a 20Run on a whole Zeek directory and score each log independently:
python3 zeek-anomaly-detector.py -d /path/to/zeek/logsThis is the recommended mode when you have multiple Zeek logs from the same capture, because the tool can build uid and fuid context across files before scoring.
Show detector names and feature columns:
python3 zeek-anomaly-detector.py -d /path/to/zeek/logs -v 1Show feature samples too:
python3 zeek-anomaly-detector.py -d /path/to/zeek/logs -e 1Dump enriched per-log dataframes to CSV:
python3 zeek-anomaly-detector.py -d /path/to/zeek/logs -D output_csvsFor a single output file:
python3 zeek-anomaly-detector.py -f dataset/001-zeek-scenario-malicious/conn.log -D conn.csvWrite a machine-readable summary of the run:
python3 zeek-anomaly-detector.py -d /path/to/zeek/logs -J summary.jsonThe JSON export includes:
input_pathdirectory_summaryfiles
Each file entry contains:
- Log name
- Total rows
- Number and fraction of anomalous rows
- Top anomaly score statistics
- Detector method
- Feature columns used
- Related anomalous
uidandfuidvalues - Top anomalous rows as JSON records
This is the recommended output if you want to compare many Zeek directories programmatically or feed the results into another analysis stage.
Write a multi-page PDF with flow-by-flow score plots for each log plus a final summary page:
python3 zeek-anomaly-detector.py -d /path/to/zeek/logs -P scores.pdfThe PDF contains:
- One summary page with the final directory score and the main score components
- One combined flow-by-flow page across all log files
- One score plot per Zeek log file
If you also use -N or --normal-dir, the summary page overlays:
- Blue bars for the suspect directory
- A green line for the learned normal median of each directory-summary metric
- A red dashed line for the learned normal threshold of each metric
Each per-file plot shows:
- A blue line for the score of every flow or row, in file order
- Red markers for the rows flagged as anomalous
- An orange dashed cutoff line for the last displayed anomaly score
The combined page shows:
- All rows from all files on one shared timeline
- Within-file normalized score percentiles on the y-axis, so different log types can be compared fairly
- File boundaries and labels on the x-axis
- Red markers for anomalous rows across the whole run
This is useful when you want to see whether anomalies are isolated spikes, repeated bursts, or broad campaigns across a file.
If your goal is to find malicious flows or attack activity first, focus on:
conn.loghttp.logfiles.logssh.logtls.logif availableweird.lognotice.log
These logs are still processed, but the interpretation is different:
known_hosts.logknown_services.logsoftware.logpacket_filter.logstats.logcapture_loss.log
They are useful for novelty, drift, and operating-context anomalies, not just for direct malicious-flow detection.
Do not assume every anomaly means the same thing:
- In
conn.log, an anomaly usually means a strange flow pattern. - In
http.log, it often means a strange application transaction or a request tied to unusual content. - In
files.log, it often means suspicious content transfer behavior. - In
weird.logornotice.log, it usually means high-priority events or rare protocol/parser observations. - In inventory logs, it usually means novelty or environmental drift.
- Scores are per-log rankings, not globally calibrated risk scores.
- This is unsupervised detection. It surfaces unusual behavior, not guaranteed malicious behavior.
- Inventory logs can produce valid novelty detections that are operationally interesting but not necessarily attacks.
- The current implementation relies on the fields present in each Zeek log. Sparse logs naturally produce simpler detectors.
Create an issue or PR and we will process it.
This project was created by Sebastian Garcia and Veronica Valeros at the Stratosphere Research Laboratory, AIC, FEE, Czech Technical University in Prague.