A unified framework for detecting point and collective anomalies in operating system logs via collaborative transformers

Mohammad Nasirzadeh; Jafar Tahmoresnezhad; Parviz Rashidi-Khazaee

A unified framework for detecting point and collective anomalies in operating system logs via collaborative transformers

Intermediate

Mohammad Nasirzadeh, Jafar Tahmoresnezhad, Parviz Rashidi-Khazaee12/29/2025

arXiv PDF

Key Summary

•CoLog is a new AI system that reads computer logs like a story and spots both single strange events (point anomalies) and strange patterns over time (collective anomalies).
•It treats log safety like mood checking: normal logs are like positive feelings, and abnormal logs are like negative feelings.
•Instead of looking at just words or just order, CoLog looks at both the meaning of each log line (semantic modality) and the order they appear (sequence modality) at the same time.
•A special collaborative transformer lets these two views talk to each other so the model learns deeper, smarter patterns.
•A new attention trick called multi-head impressed attention helps one view guide the other, like two friends helping each other focus.
•A modality adaptation layer cleans and aligns the two views so they fit together in the same space, reducing noise and clashes.
•A balancing layer learns how much to trust each view before making a final decision, which reduces false alarms.
•Across seven benchmark datasets, CoLog reached about 99.6% precision, recall, and F1 score, beating strong existing methods.
•It works in a unified way for both point and collective anomalies, so you don’t need two separate systems.
•This matters for cybersecurity, system health, and keeping services running smoothly without surprises.

Why This Research Matters

Logs are the first place engineers look when something breaks or someone attacks a system. CoLog helps find problems faster and more accurately by understanding both what a log says and when it appears. With fewer false alarms and fewer misses, teams can focus on real issues, saving time and reducing downtime. A unified approach means simpler operations: one system to catch both sudden spikes and slow-burning patterns. This boosts reliability for apps you use every day, from streaming shows to online classes. In critical settings like hospitals or power grids, earlier detection can protect safety and service continuity.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re the hall monitor of a big school. Every time something happens—doors open, bells ring, lights flicker—you write it down in a notebook. That notebook is like a computer’s log file.

🥬 The Concept (System Logs): System logs are the computer’s diary of events. How it works: (1) the system writes each event with time and details, (2) many programs add their own notes, (3) the log grows fast, and (4) people or AI read it to spot problems. Why it matters: Without logs, we can’t know what went wrong or when. 🍞 Anchor: When your video game crashes and later works after an update, the log is where engineers look to find the bug.

🍞 Hook: You know how a smoke alarm notices unusual smoke even if it doesn’t know the exact cause?

🥬 The Concept (Anomaly): An anomaly is something that doesn’t fit the usual pattern. How it works: (1) learn what “normal” looks like, (2) compare new events to normal, (3) flag big differences as anomalies. Why it matters: Missing anomalies can mean missed attacks or failures. 🍞 Anchor: If your school lunch menu lists pizza every Friday and suddenly it’s octopus stew, that’s an anomaly.

🍞 Hook: Spotting one weird thing is easy, but sometimes a bunch of small clues together tell the real story.

🥬 The Concept (Point vs. Collective Anomalies): A point anomaly is one odd event; a collective anomaly is a pattern of events that together are odd. How it works: (1) point: check each event, (2) collective: check sequences/windows for unusual combos, (3) label what’s abnormal. Why it matters: Only watching single points can miss slow-burning issues; only watching patterns can miss urgent spikes. 🍞 Anchor: A single firecracker pop (point) vs. a steady drumbeat getting faster and louder (collective).

The world before: For years, people analyzed logs in two main ways. First, with rules: “If this exact message appears, raise an alarm.” That worked until logs changed wording or new problems appeared. Second, with classic machine learning: extract features by hand, then classify. But logs are huge and change fast, so handcrafting features falls behind. Deep learning helped by learning from sequences of log events like reading sentences, but most models picked only one way to look at logs (just the sequence or just the words), missing valuable clues.

The problem: Logs actually carry multiple kinds of information (modalities). Two big ones are (1) semantic modality: the meaning of each log line’s text, and (2) sequence modality: the order and context of events around it. Unimodal methods ignore one view. Many multimodal methods combine views too early or too late, or run separate models that never really talk. That leads to high-dimensional noise, feature clashes, and missing cross-view interactions.

Failed attempts: Early fusion (just glue features together) makes inputs bulky and noisy. Late fusion (combine final scores) misses the rich interplay between modalities (the “why did this line matter now?”). Separate models per modality increase complexity and duplicate work. Some time-series methods for collective anomalies use fixed windows or post-hoc scores, risking false alarms or missed patterns when real sequences vary.

The gap: We needed a single framework where modalities learn together and guide each other, with a way to (a) share information, (b) adapt representations into a common space, and (c) balance their contributions. We also needed one system to detect both point and collective anomalies, not two separate pipelines.

Real stakes: In daily life, this means fewer outages for your favorite apps, quicker detection of hacks, safer smart homes, and smoother streaming during big events. In data centers, an early catch can prevent a cascade of failures. In hospitals or factories, it can keep critical systems running and safe.

02Core Idea

🍞 Hook: You know how two detectives working together spot clues one detective alone would miss?

🥬 The Concept (Collaborative Transformers): A collaborative transformer is an AI that lets different views of the same data (like words and order) teach each other while they learn. How it works: (1) encode each view with attention, (2) let one view guide the other via impressed attention, (3) clean and align both views in a shared space, (4) balance their influence before deciding. Why it matters: Without collaboration, you miss cross-view hints—like why a harmless word becomes dangerous in a certain order. 🍞 Anchor: A detective reading messages (semantics) while a partner tracks timelines (sequence) will solve the case faster together.

The “Aha!” moment in one sentence: Treat log anomaly detection like multimodal sentiment analysis and let a collaborative transformer make the semantic and sequence views teach each other, adapt to each other, and vote together—so one model catches both point and collective anomalies.

Three analogies for the same idea:

Orchestra: The semantic modality is the melody (what is said), the sequence modality is the rhythm (when it’s said). The conductor (collaborative transformer) keeps them in sync, adjusts volumes (balancing), and cleans noise (adaptation), creating a harmonious performance (accurate detection).
Two-eyes vision: One eye sees text meaning; the other eye sees context over time. Depth (true understanding) appears only when both eyes work together with precise alignment (MAL) and focus (impressed attention).
Cooking team: One chef tastes ingredients (semantics), the other times the cooking (sequence). The head chef (balancer) decides how much to trust each chef’s advice for the perfect dish (final decision).

🍞 Hook: When reading, you naturally focus more on the important words and less on the fillers.

🥬 The Concept (Attention Mechanism): Attention lets AI weigh parts of input by importance. How it works: (1) compare items to what matters, (2) give high scores to helpful pieces, (3) mix them weighted by scores. Why it matters: Without attention, every word or event seems equally important, causing confusion. 🍞 Anchor: In “What’s the capital of France?”, attention boosts “capital” and “France,” leading to “Paris.”

🍞 Hook: Imagine you’re rebuilding a paragraph by looking at all its words at once instead of one by one.

🥬 The Concept (Transformer): A transformer is an AI that uses attention to process sequences in parallel. How it works: (1) embed tokens, (2) compute attention across positions, (3) stack layers to learn complex patterns. Why it matters: Without transformers, long-range relationships are hard and slow to learn. 🍞 Anchor: It can connect “login” at the start with “denied” much later to flag a problem.

🍞 Hook: Sometimes you need one friend to guide another’s focus.

🥬 The Concept (Multi-Head Impressed Attention): Impressed attention lets one modality’s keys/values guide the other modality’s queries. How it works: (1) semantic asks questions (queries), (2) sequence supplies hints (keys/values), (3) multi-heads explore different relations in parallel. Why it matters: Without it, the views don’t truly interact, and subtle cross-view cues get lost. 🍞 Anchor: A timeline expert pointing out when certain words matter—like “failed” right after “admin login.”

🍞 Hook: Different languages need a translator to avoid misunderstandings.

🥬 The Concept (Modality Adaptation Layer, MAL): MAL re-expresses each modality in a higher-dimensional shared space and uses soft attention to clean noise and align information. How it works: (1) lift features to a shared high-dim space, (2) learn per-position weights, (3) recombine into a cleaner, compatible representation, (4) add residuals and normalize. Why it matters: Without MAL, mixing modalities can be messy and inconsistent. 🍞 Anchor: Turning messy notes from two classmates into a neat, common study sheet.

Before vs. After:

Before: Separate or loosely fused models; fragile to noise, missed cross-modal clues, separate pipelines for point vs. collective anomalies.
After: A single, end-to-end system where modalities collaborate, adapt, and get balanced—handling both anomaly types in one go.

Why it works (intuition):

Cross-modal guidance (impressed attention) teaches each view when the other says “hey, this matters right here.”
Adaptation (MAL) avoids incompatibility, letting apples and oranges become fruit juice in the same pitcher.
Balancing learns how much to trust each view per case—essential when one view is noisy or less informative.
Unified training improves data efficiency and consistency.

Building blocks:

Attention and Transformations for each modality.
Multi-Head Impressed Attention for cross-modal focus.
Modality Adaptation Layer for cleaning and aligning.
Balancing Layer for weighting modalities before fusion.
Classifier for final point and collective decisions.

🍞 Hook: Sometimes one viewpoint should speak louder; sometimes the other.

🥬 The Concept (Balancing Layer): The balancing layer learns weights for each modality before fusing them. How it works: (1) project to a shared latent space, (2) compute soft attention over modalities, (3) weight and sum, (4) classify. Why it matters: Without balancing, a weaker modality can drown out a stronger one, causing mistakes. 🍞 Anchor: In a debate, the moderator gives speaking time based on who has the best evidence for that question.

03Methodology

At a high level: Raw Logs → Preprocess & Parse → Build Semantic and Sequence Modalities → Class Balance (Tomek link) → Collaborative Transformer Blocks (with Multi-Head Impressed Attention) → Modality Adaptation Layer → Balancing Layer (latent fusion) → Classifier → Anomaly labels (point or point+collective).

Step 1: Preprocessing & Parsing 🍞 Hook: Turning messy notes into neat flashcards makes studying easier.

🥬 The Concept (Log Parsing): Log parsing turns unstructured log text into structured fields and messages. How it works: (1) split lines into parts like time, level, message, (2) normalize text (lowercase, tokenize), (3) keep the message for meaning. Why it matters: Without parsing, the model wastes effort on irrelevant bits and gets confused. 🍞 Anchor: Pulling only the sentence from a chat message and ignoring the timestamp.

🍞 Hook: Two popular helpers: one uses patterns; one uses learned entities.

🥬 The Concept (Drain & NER Log Parser): Drain uses a tree and token similarity to group templates fast; a NER log parser (with BiLSTM) tags named entities like timestamps/hosts automatically. How it works: (1) Drain groups messages by structure, (2) NER tags fields learned from many logs, (3) both output cleaner messages. Why it matters: Without good parsers, you mix apples (messages) with pebbles (IDs), adding noise. 🍞 Anchor: Drain is like sorting library books by spine pattern; NER is like a librarian who recognizes names and dates in any book.

Step 2: Build Modalities 🍞 Hook: Looking at a photo and also at the order you swipe through photos tells a richer story.

🥬 The Concept (Semantic Modality): The semantic modality captures the meaning of each log line. How it works: (1) tokenize words, (2) embed words into vectors (e.g., 300-d), (3) optionally use sentence embeddings like SBERT (e.g., 384-d), (4) pad/truncate to fixed length. Why it matters: Without meaning, the model can’t tell “warning” from “ok.” 🍞 Anchor: Turning “kernel fatal error” into a vector that says “this sounds bad.”

🍞 Hook: Events make more sense with their neighbors.

🥬 The Concept (Sequence Modality: background/context windows): The sequence modality stacks the semantic vectors of nearby events as background (before) or context (before+after). How it works: (1) pick window W, (2) collect W previous (and maybe W next) event embeddings, (3) stack into a matrix, (4) feed to the model. Why it matters: Without neighbors, you miss if a normal-looking line happens at a very wrong time. 🍞 Anchor: A single “restart” might be fine during maintenance but scary during peak traffic.

Step 3: Handle Class Imbalance 🍞 Hook: If your class has 25 soccer fans and only 2 chess fans, the vote always skews soccer unless you fix it.

🥬 The Concept (Tomek Link): Tomek link under-sampling removes borderline majority samples that confuse the boundary. How it works: (1) find nearest neighbor pairs from different classes, (2) drop majority items in those pairs, (3) repeat until neighbors are cleaner. Why it matters: Without balancing, the model can ignore rare but crucial anomalies. 🍞 Anchor: Clearing out noisy lookalikes so the chess fans’ votes matter.

Step 4: Collaborative Transformer Encoding 🍞 Hook: Two readers exchange notes while reading the same story.

🥬 The Concept (Collaborative Transformer with Multi-Head Impressed Attention): Each modality is encoded with attention while being guided by the other modality’s signals. How it works: (1) compute queries from the current modality, (2) take keys/values from the other modality, (3) apply multi-head attention, (4) add residuals and layer norms, (5) pass through MLP. Why it matters: Without cross-guidance, subtle relationships like “this word is only scary after those three events” can get missed. 🍞 Anchor: The timeline expert whispers, “Pay attention to that word now,” and the reader focuses.

Step 5: Clean and Align with MAL 🍞 Hook: Before mixing two paints, you strain each one so bits don’t clump.

🥬 The Concept (Modality Adaptation Layer): MAL lifts each modality into a higher-dimensional shared space and applies soft attention to denoise and align positions. How it works: (1) project to 2k dimensions, (2) learn per-node weights via softmax, (3) weighted sum to create a cleaner node representation, (4) residual + layer norm. Why it matters: Without MAL, fused features fight each other and add noise. 🍞 Anchor: Turning two messy notes into one clean, well-aligned study guide line-by-line.

Step 6: Balance and Fuse 🍞 Hook: Sometimes the meaning matters more; sometimes the timing matters more.

🥬 The Concept (Balancing Layer & Latent Space): The balancing layer learns soft weights for each modality in a shared latent space before fusion. How it works: (1) project each modality output into a high-dim latent space, (2) compute attention-based weights per modality, (3) fuse by weighted sum, (4) normalize. Why it matters: Without balancing, a noisy view can dominate and trigger false alarms. 🍞 Anchor: In rain, you trust the weather radar (sequence) more than the sky color (semantics)—the balancer learns this.

Step 7: Classify (Point or Point+Collective) 🍞 Hook: After gathering clues, the detective gives the verdict.

🥬 The Concept (Unified Classification): The final MLP predicts labels. For point-only: normal vs. anomaly. For unified: four classes—(a) event only abnormal, (b) both event and background/context abnormal, (c) background/context only abnormal, (d) all normal. How it works: (1) take fused vector, (2) MLP + activation, (3) output probabilities, (4) choose class. Why it matters: Without a unified setup, you’d need two systems, causing inconsistency and extra cost. 🍞 Anchor: One referee who can call fouls on a single player or on the whole team play.

Example with real-ish data: Suppose logs show many “INFO” messages, then suddenly “FATAL data TLB error interrupt.” The semantic modality flags “FATAL” and “error.” The sequence modality notices it follows unusual core dumps and memory issues. Impressed attention boosts the semantic focus because sequence says “this timing is bad.” MAL cleans and aligns both. The balancer gives more weight to sequence this time. The classifier marks this as anomaly, possibly both point and collective.

Secret sauce:

Multi-Head Impressed Attention for true cross-modal teaching.
MAL to harmonize modalities and reduce noise.
Balancing layer to adaptively weight views per case.
Concurrent encoding so the two views learn together, not one after the other.

04Experiments & Results

The test: The researchers wanted to see if CoLog could (1) catch anomalies accurately, (2) handle both point and collective cases, and (3) generalize across many real log sources. They measured precision (how many alarms were correct), recall (how many true problems were found), F1 (the balance of both), and accuracy.

The data: CoLog was evaluated on multiple public operating-system log datasets, including supercomputers (BlueGene/L), big data platforms (Hadoop, Zookeeper, Spark), compromised Linux hosts (Honey5, Honey7), Windows logs, and other Linux sources (Casper, jhuisi, nssal). This covers millions of lines with varied formats and anomaly types.

The competition: CoLog was compared with strong deep-learning log detectors, including sequence-only, semantic-only, transformer-based, and prior multimodal fusion approaches (early/intermediate/late) that represent the state of the art.

The scoreboard (context included):

Mean Precision ≈ 99.63%: Think of it as calling out almost only real fires, not false alarms—like getting an A+ for accuracy when others got closer to a B.
Mean Recall ≈ 99.59%: CoLog found almost every real problem, missing very few—like searching a huge library and almost never overlooking the needed book.
Mean F1 ≈ 99.61% across seven benchmarks: This is the balanced excellence—both careful and thorough.

Why these results matter: In log anomaly detection, false alarms waste team time; missed alarms risk outages or security incidents. Getting both near-perfect is rare and valuable. CoLog’s unified approach avoids the common pitfalls of multimodal fusion and connects dots that others miss.

Surprising findings:

The balancing layer learned to trust different modalities for different datasets. For example, in some noisy text datasets, sequence carried more weight; in other cases with stable sequences but rich messages, semantics led. This dynamic weighting reduced false positives.
The modality adaptation layer consistently improved results over simple concatenation or late fusion, suggesting that alignment-cleaning in a shared high-dimensional space is key.
Unified point+collective detection avoided the awkwardness of running two pipelines. It reduced edge-case errors where a single weird event inside a clearly abnormal sequence was misclassified by separate systems.

Generalization and robustness: On datasets not used for training certain models (like Spark, Honey5, or Windows), CoLog still performed strongly, indicating that the collaborative learning of semantics+sequence captures reusable patterns. This is crucial because real-world logs evolve and vary across systems.

Efficiency notes: Transformers process sequences in parallel, and although CoLog includes extra modules (MAL, balancer), the overall runtime stayed practical for large logs, especially with GPU support. The early stopping and careful hyperparameter tuning helped keep training stable.

Bottom line: CoLog didn’t just inch past baselines; it delivered across-the-board gains with meaningful interpretability signals (attention weights) that help explain which words and which timeline positions mattered.

05Discussion & Limitations

Limitations:

Log quality dependency: If important events aren’t logged (e.g., logging is disabled, tampered, or missing), CoLog can’t detect them. It analyzes what’s written, not what isn’t.
Supervised labeling: The strongest results come when labels (normal/anomalous) are available or can be reasonably inferred. Building labels in new domains can be costly.
Domain drift and format shifts: Rapidly changing log formats or terminology can still challenge embeddings and parsers, though multimodal learning helps.
Window choices: Very small or very large context windows can underfit or overfit collective patterns; tuning windowing is important.
Compute and memory: Collaborative transformers with MAL and balancing benefit from GPU acceleration and sufficient RAM, especially on massive logs.

Required resources:

A modern GPU (e.g., NVIDIA A100 or similar) for training at scale; CPU-only is possible but slower.
A log parser (like Drain or an NER-based parser) and a sentence embedding model (e.g., SBERT) to build modalities.
Storage for large log corpora and intermediate embeddings.
Basic MLOps to handle hyperparameter tuning, early stopping, and versioning.

When NOT to use:

When critical events are known to be absent from logs (e.g., severe logging gaps or log forgery threats not mitigated).
In ultra-low-latency embedded settings without GPU/accelerator where even lightweight transformers are too heavy.
When you cannot parse or embed text due to strict privacy policies and no secure alternative is allowed.
For domains where anomalies are defined purely by external, non-log signals (e.g., physical sensors only, no textual traces).

Open questions:

Can we extend CoLog to semi/unsupervised settings with strong performance, reducing labeling costs?
How can we make CoLog adapt online to log drift with continual learning while avoiding catastrophic forgetting?
Can we automatically learn variable-length windows for collective anomaly detection without fixed-size tradeoffs?
How can we bolster robustness against log poisoning or adversarial tampering?
Can we provide richer, human-friendly explanations (beyond attention maps) that link causes, effects, and timelines?

06Conclusion & Future Work

Three-sentence summary: CoLog is a unified, multimodal AI that reads log text meaning and event order together to detect both single odd events and odd patterns over time. It uses collaborative transformers with multi-head impressed attention so each view guides the other, a modality adaptation layer to align and clean representations, and a balancing layer to weight them smartly. Across many real datasets, CoLog achieves near-perfect precision, recall, and F1 while offering interpretability and practicality.

Main achievement: Turning log anomaly detection into multimodal sentiment analysis inside a collaborative transformer—so semantics and sequence cross-teach, adapt, and balance—delivering one coherent system for both point and collective anomalies.

Future directions: Make labeling lighter via semi/unsupervised learning, add online adaptation to log drift, strengthen defenses against adversarial log manipulation, and advance explanations from attention weights to causal storylines that analysts can trust.

Why remember this: CoLog shows that the best way to understand logs is like understanding a conversation—what was said and when it was said, together. When these views truly collaborate, detection becomes sharper, more robust, and easier to trust.

Practical Applications

•Security Operations Centers (SOCs) to detect intrusions that show up as strange log messages or unusual event sequences.
•Cloud platform monitoring to catch service degradations early across many servers.
•Data pipeline health checks in systems like Hadoop and Spark to prevent job failures.
•Incident triage: rank alerts by confidence and highlight which words and timeline spots mattered.
•AIOps automation: trigger safe rollback or restart actions when collective anomalies form.
•IoT fleet monitoring where individual devices emit short logs but sequences across time reveal issues.
•Compliance and audit: document abnormal log patterns with interpretable attention evidence.
•Performance tuning: identify recurring slowdowns that only appear as collective patterns.
•Edge computing clusters: detect misconfigurations that produce subtle, context-dependent errors.
•DevOps release validation: spot risky anomalies right after deployment when logs shift.

Version: 1