WAY: Estimation of Vessel Destination in Worldwide AIS Trajectory

Jin Sob Kim; Hyun Joon Park; Wooseok Shin; Dongil Park; Sung Won Han

WAY: Estimation of Vessel Destination in Worldwide AIS Trajectory

Intermediate

Jin Sob Kim, Hyun Joon Park, Wooseok Shin et al.12/15/2025

arXiv

Key Summary

•Ships constantly broadcast AIS messages, but these messages are messy, unevenly spaced in time, and sometimes wrong.
•This paper builds WAY, a deep learning model that predicts a ship’s destination port days or even weeks before arrival using worldwide AIS data.
•The key trick is to reorganize each long trip into a nested sequence of grid areas that preserves fine details while reducing bias from irregular message timing.
•WAY represents each trip as four channels: global location identity, local movement patterns, ship/port semantics, and time progress.
•A Channel-Aggregative Sequential Processing (CASP) block learns what to focus on across channels and over time using attention.
•A special training method called Gradient Dropout prevents very long trips from overpowering the learning signal.
•On 5 years of global AIS data, WAY beats strong baselines (LSTM, GRU, Transformer, TrAISformer), reaching about 80% accuracy.
•Gradient Dropout improves almost every model it is added to, not just WAY.
•WAY can be extended to also estimate arrival time (ETA), cutting average error from 4.26 days (human-entered ETA) to ~3.0 days, though better labels are needed to go further.

Why This Research Matters

Global trade depends on ships arriving at the right ports at the right times, yet ports often get congested because arrivals are hard to predict early. By making accurate destination predictions days or weeks in advance, WAY helps ports prepare cranes, berths, and labor more efficiently. Shipping companies can optimize fuel use and routing, avoiding costly last-minute changes. Governments and coast guards can better monitor traffic patterns for safety and environmental protection. Consumers benefit too—fewer delays mean smoother supply chains and more reliable delivery of everyday goods. Over time, methods like WAY can help reduce emissions by cutting idle time and unnecessary detours at sea.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how your phone’s map shows where you are as you move, but sometimes the GPS jumps around or freezes? Ships have a similar system, and it can be messy too.

🥬 The Concept: AIS (Automatic Identification System) is a global radio system where ships regularly send out their identity, position, speed, and other info. How it works: (1) Each ship transmits AIS messages; (2) Satellites and receivers collect them; (3) We get a giant timeline of where ships went. Why it matters: Without AIS, we can’t watch global marine traffic in near real time, plan port resources, or keep seas safe.

🍞 Anchor: Imagine 5 years of text messages from 5,000+ ships telling you, “I’m here, going this fast,” but sometimes late, misspelled, or missing. That’s AIS data.

🍞 Hook: Imagine trying to read a story where some pages have tons of lines and others have only a few. It’s hard to understand the plot consistently.

🥬 The Concept: Spatio-temporal bias in AIS means messages arrive at irregular times and places, so some areas are overrepresented while others are sparse. How it works: (1) Ships send messages more often in busy/coastal zones; (2) Open ocean has fewer messages; (3) Training models on this directly overfits crowded areas. Why it matters: If we don’t fix this, a model may do well near ports but fail far offshore.

🍞 Anchor: A model might think, “Everyone ends up in big ports,” simply because it saw more messages there, not because that’s always true.

🍞 Hook: Picture trying to predict where a road trip ends, but you’re only allowed to zoom in on one city at a time, ignoring the rest of the world.

🥬 The Concept: Region of Interest (ROI) constraints limit many past studies to small zones. How it works: (1) Choose a small map area; (2) Train and predict within it; (3) Ignore long, intercontinental trips. Why it matters: This fails for global routing, where ships cross oceans and multiple regions.

🍞 Anchor: It’s like guessing the final stop of a cross-country train while only looking at one station’s camera.

🍞 Hook: Think of replacing each word in a long sentence with a big square token—easy to handle, but you lose the exact letters that make the word special.

🥬 The Concept: Spatial grid tokenization turns positions into coarse grid IDs. How it works: (1) Split the ocean into big squares; (2) Map positions to square IDs (tokens); (3) Process sequences of tokens. Why it matters: This reduces timing bias but loses fine movement details and can explode parameters when grids get small.

🍞 Anchor: If two different paths pass through the same squares, the model may treat them as identical, even if speeds and turns were very different.

🍞 Hook: Imagine cutting a long movie into chapters (big blocks) and scenes (fine details) so you can understand both the big story and the small actions.

🥬 The Concept: Port-to-port annotation is the process of carving raw AIS streams into clean trips from a known departure port to a known destination port. How it works: (1) Match messy human-entered “destination” text to real ports with string similarity; (2) Verify docking using geometry and speed; (3) Remove illogical jumps with clustering; (4) Keep only clean, start-to-end journeys. Why it matters: Without clean trips, models learn from noise and contradictions, hurting predictions.

🍞 Anchor: After annotation, each trip is a tidy story: it starts at Port A, moves logically, and ends at Port B.

The World Before: Most methods focused on local areas, interpolated messages to even out time gaps, or tokenized the world into grids, often losing movement detail or struggling with long global routes. The Problem: Predict the final port far ahead (days to weeks) using noisy, irregular, worldwide AIS. Failed Attempts: Pure message-wise modeling overfits dense zones; pure grids remove useful details and require huge embedding tables. The Gap: We need a representation that reduces timing bias without losing fine motion and a model that fuses multiple kinds of ship information. Real Stakes: Better destination and arrival predictions help fight port congestion, reduce delays and costs, and make ocean logistics smoother for goods we all use.

02Core Idea

The “Aha!” Moment (one sentence): Recast each long, messy AIS trip into a nested sequence of grid steps that still carries fine-grained motion and semantics, then teach a model (WAY) to fuse channel-wise information and long-range context with attention, while balancing training with Gradient Dropout.

Three Analogies:

Book-and-chapters: The whole voyage is the book; grid steps are chapters; within each chapter are scenes (the fine movements). The model reads both chapters and scenes to guess the ending.
Orchestra: Different instruments (channels: location identity, local motion, ship/port info, time) play together. Channel attention is the conductor who balances instruments, while self-attention remembers past melodies.
Detective board: World map squares are the board; sticky notes hold detailed clues from each square; strings show timing. The model learns which notes matter most and how clues connect over time to catch the final destination.

Before vs After:

Before: Either message-wise models that choke on irregular timing or coarse grid tokens that hide details; heavy ROI limits.
After: A nested structure preserves small motions inside each grid step; four-channel representations capture space, movement, meaning, and time; attention fuses them globally; Gradient Dropout keeps learning fair across short and long trips.

Why It Works (intuition, no equations):

Nested sequences reduce timing bias (multiple raw messages per grid) while keeping movement details inside the grid.
Four channels separate what is different: where you are on Earth (spatial identity), how you moved locally (micro-patterns), who you are and where you left (semantics), and how far along you are in time (irregular time encoding).
Channel attention learns which channel to trust more at each step (e.g., motion early, port semantics later). Self-attention carries long-term hints without forgetting.
Gradient Dropout stops very long trips from flooding the training signal by randomly turning off some per-step updates for those longer samples.

Building Blocks (each with Sandwich):

🍞 Hook: Imagine stacking cups—big cups (grid steps) holding smaller cups (raw messages). You can carry both the big picture and the tiny splashes. 🥬 The Concept: Nested sequence structure organizes a trip into grid steps, each holding its local subsequence of messages. How it works: (1) Divide the world into uniform grids; (2) Group raw AIS points inside each grid step; (3) Keep both the grid ID and the fine movement snippets. Why it matters: You reduce time irregularity while preserving details. 🍞 Anchor: A transatlantic voyage becomes 100 grid steps, each with a handful of timestamped moves inside.

🍞 Hook: You know how coordinates on a globe are just numbers, but a compass rose feels like a pattern? 🥬 The Concept: Spatial Encoding turns latitude/longitude into a patterned vector that preserves spatial relationships. How it works: (1) Use sinusoidal functions to map (lat, lon) to high-dimensional waves; (2) Near places have similar patterns; (3) Far places differ cyclically (especially across longitudes). Why it matters: The model understands “where” without a huge lookup table. 🍞 Anchor: Two grids near Singapore get similar encodings; across the Pacific, they look very different.

🍞 Hook: Think of an odometer ticking forward—not evenly, but whenever the car moves. 🥬 The Concept: Time Encoding captures irregular time progress as smooth waves. How it works: (1) Convert elapsed days since trip start to sinusoidal features; (2) Cover short to very long timescales; (3) Add these to other channels so the model knows “how far along” it is. Why it matters: Without this, early and late steps look the same. 🍞 Anchor: Day 2 and Day 20 create different patterns, nudging the model to make bolder guesses later.

🍞 Hook: Watching a short clip from each chapter helps you remember a character’s behavior. 🥬 The Concept: Local pattern GRU summarizes the fine motion inside each grid step. How it works: (1) Feed the subsequence (speed, course, relative coords) into a GRU; (2) The last hidden state becomes the “micro-movement” summary for that step. Why it matters: Coarse grids alone miss sharp turns, slowdowns, or loitering. 🍞 Anchor: A ship that slows and turns near a strait stores a distinct local signature.

🍞 Hook: When four friends tell you a story—where, how, who, and when—you first decide whose input matters most, then stitch the story together in order. 🥬 The Concept: CASP (Channel-Aggregative Sequential Processing) fuses channels and passes information across time. How it works: (1) Multi-Head Channel Attention weights the four channels per step; (2) Masked Self-Attention (Transformer decoder-style) sends useful signals from past steps to the present; (3) A shared feed-forward network refines features. Why it matters: The model learns to trust different signals at different times and to remember long-range context. 🍞 Anchor: Early on, ship type might dominate; near coasts, local motion clues win.

🍞 Hook: A spotlight with many beams can highlight different clues at once. 🥬 The Concept: Multi-Head Channel Attention (MCA) lets multiple tiny “experts” compare channels from different perspectives. How it works: (1) Split features into several heads; (2) Each head scores channel importance; (3) Combine heads to get a robust per-step summary. Why it matters: One view can miss things; many views catch more patterns. 🍞 Anchor: One head keys on motion, another on departure port, another on global position.

🍞 Hook: When telling a story, you shouldn’t use hints from the future to guess the present. 🥬 The Concept: Masked Self-Attention (Transformer decoder) only lets the present look back, not forward. How it works: (1) Compute attention between current and past steps; (2) Block future steps; (3) Mix past info into the current feature. Why it matters: Prevents peeking ahead while capturing long-term dependencies better than plain recurrence. 🍞 Anchor: At step 20, the model can use steps 1–19, but not 21.

🍞 Hook: In a classroom with talkative students, a few long storytellers can drown out others. 🥬 The Concept: Gradient Dropout reduces training bias from very long trips. How it works: (1) Compute a sampling ratio that shrinks with trip length; (2) Randomly drop some per-step losses for long trips; (3) Balance updates across the batch. Why it matters: Keeps learning fair so short trips aren’t ignored. 🍞 Anchor: A 200-step crossing no longer overwhelms a 20-step coastal hop.

03Methodology

High-level Overview: Input (annotated AIS trip) → Representation Layer (4 channels: spatial identity, local pattern, semantics, time) → Stacked CASP blocks (MCA → masked self-attention → shared feed-forward with residuals) → Classifier (destination port probabilities).

Step-by-step with Sandwich explanations and examples:

Port-to-port Annotation (data preparation) 🍞 Hook: Imagine cleaning up a scribbly diary into neat chapters with clear starts and ends. 🥬 The Concept: Annotation extracts clean trips from noisy AIS logs. How it works: (1) From human-entered destinations, use string similarity (Damerau–Levenshtein) to propose port candidates; (2) Use geometry around port polygons and speed to confirm docking/undocking; (3) Use DBSCAN on edge features (time, distance, speed) to remove impossible jumps; (4) Keep only trips that start at one known port and end at another. Why it matters: Models trained on messy stories learn confusion, not patterns. 🍞 Anchor: A container ship’s 60-day voyage becomes a single, verified sequence: departed Busan, arrived Los Angeles.
Nested Sequence via Spatial Grids 🍞 Hook: Think of the ocean as a chessboard; each move you group all small wiggles inside that square. 🥬 The Concept: Group raw messages inside each 1×1-degree grid into a subsequence. How it works: (1) Sort AIS by time; (2) Assign points to grids; (3) Within each grid, sample a few messages (Poisson with λ≈5) to represent local behavior; (4) Convert positions to relative offsets from the grid center and compute time distances from the trip start. Why it matters: Reduces timing bias (many points in busy zones) but keeps fine movement clues. 🍞 Anchor: In a busy strait, 100 raw points compress into a small, representative micro-sequence.
Representation Layer (4 channels)

3a) Spatial Encoding (global identity) 🍞 Hook: A musical theme that changes predictably as you move around the globe. 🥬 The Concept: Map (lat, lon) to sinusoidal vectors that reflect real distances and periodicity. How it works: Combine sines/cosines at multiple frequencies so nearby grids look similar and far ones diverge; longitudes wrap cyclically. Why it matters: Captures “where on Earth” without a massive embedding table. 🍞 Anchor: Two adjacent Atlantic grids have close encodings; across the date line, the shift is smooth but distinct.

3b) Local Pattern GRU (micro-movements) 🍞 Hook: A short highlight reel per square. 🥬 The Concept: Summarize the sampled subsequence (relative lon/lat, speed, course, time deltas) via a compact GRU stack. How it works: Feed each point in order; keep the last hidden state as the local signature. Why it matters: Records turning, slowing, and loitering that grids alone miss. 🍞 Anchor: Approaching a harbor, the reel shows slow speed and frequent course changes.

3c) Semantics (departure port, ship type) 🍞 Hook: Knowing both the traveler and their home base helps guess the destination. 🥬 The Concept: Learn embeddings for departure port and ship type. How it works: Lookup vectors from trainable tables; repeat them across steps; add timing via Time Encoding so their influence can evolve. Why it matters: Tankers, bulkers, and container ships follow different trade patterns; departures constrain likely arrivals. 🍞 Anchor: A tanker leaving Ras Tanura suggests likely refinery ports.

3d) Time Encoding (irregular progression) 🍞 Hook: A progress bar that isn’t linear but still tells you how far along you are. 🥬 The Concept: Convert days since trip start into sinusoidal vectors covering short to long horizons. How it works: For each step’s last local timestamp, create multi-scale sines/cosines; add to other channels. Why it matters: Early steps need broader guesses; late steps should narrow sharply. 🍞 Anchor: Day 25 pushes the model toward likely endpoints for that route length.

CASP Block (per-layer processing)

4a) Multi-Head Channel Attention (MCA) 🍞 Hook: A mini-panel of experts voting which channel matters most right now. 🥬 The Concept: Per step, compare channels across multiple heads, then emphasize the most informative ones. How it works: Apply average/max pooling over channels, transform, get per-channel weights via sigmoid, reweight, pool to a single vector, project back. Why it matters: Early offshore, motion may dominate; near coasts, global location or semantics may grow. 🍞 Anchor: As a ship nears Europe, spatial identity gains weight.

4b) Masked Multi-Head Self-Attention (MSA) 🍞 Hook: You can consult your past notes but not future pages. 🥬 The Concept: Let the current step attend to all previous steps (but not future ones) to gather long-range context. How it works: Build Q/K/V per head, mask the future, softmax-weight past, mix values, and project. Why it matters: Remembers the whole voyage path efficiently, beyond what recurrence can hold. 🍞 Anchor: A transoceanic crossing pattern strongly suggests certain ports.

4c) Shared Feed-Forward (SFF) and Residuals 🍞 Hook: A universal polisher that cleans every step the same way. 🥬 The Concept: Two linear layers with ReLU refine features at all steps/channels, with residual connections and layer norm. How it works: Apply the same small MLP everywhere; keep stability with residuals. Why it matters: Simple, stable feature shaping without overfitting. 🍞 Anchor: Every step gets the same finishing pass so the model stays consistent.

Design Twist: A specific channel (spatial identity) is repeatedly replaced by the MSA output across CASP layers, letting it accumulate fused context while others preserve per-step identities.

Classifier and Training with Gradient Dropout 🍞 Hook: Don’t let marathoners drown out sprinters during practice. 🥬 The Concept: Predict the destination at every step and train with cross-entropy, but apply Gradient Dropout so extra-long trips don’t over-contribute. How it works: Compute a per-trip sampling rate inversely tied to (log) length; randomly drop some step losses for long trips. Why it matters: Fair gradients across diverse voyage lengths improve generalization. 🍞 Anchor: A 180-step trip contributes fewer step-updates than ten 18-step trips combined.

Concrete Example (toy): A container ship leaves Shanghai (departure=Shanghai; type=Container). Early steps: MCA favors semantics+time; the model predicts West Coast US broadly. Mid-Pacific: motion + spatial identity dominate; prediction narrows to LA/Long Beach. Near shore: local GRU signals slow-and-align; final guess locks to Los Angeles days before arrival.

04Experiments & Results

The Test: Predict destination port at every grid step along each trip; report (1) overall accuracy; (2) accuracy by quartiles of trip progress (early to late); and (3) macro F1 to account for class imbalance. Why? Because we care both about early, long-range guesses and final certainty, and we want fairness across ports with different frequencies.

The Competition: LSTM, LSTM+Attention, GRU, GRU+Attention (classic sequence models), Transformer-decoder (strong modern baseline), and TrAISformer (Transformer-based with tokenized AIS features). All baselines use global spatial grids; some rely heavily on big embedding tables.

The Scoreboard (with context):

WAY: 79.45% accuracy overall; with Gradient Dropout (GD): 80.44%—like jumping from a solid B to a strong A-minus.
Best baseline (TrAISformer multi-resolution): 64.60%—a big gap (≈+15 to +16 percentage points).
Early-quarter performance: WAY ≈71–72% vs baselines ~41–50%—WAY is much better at early predictions, days/weeks ahead.
Macro F1: WAY ~49.45% (52.01% with GD) vs best baselines ~33–37%—WAY remains more balanced across ports.

Surprising/Notable Findings:

Gradient Dropout helps almost every model (e.g., TrAISformer accuracy +1.5%, macro F1 +~3%), confirming the length-bias problem in many-to-many training.
Attention-only sequence processing (Transformer-decoder) outperforms recurrent models on long global sequences, but still trails WAY by a wide margin, showing the power of the four-channel representation and channel attention.
Ablations (feature importance): Removing local movement patterns hurts most (accuracy drops to ~64.7%), confirming that fine-grained motion is crucial; removing departure or ship type also reduces performance.
Aggregation methods: Simple concatenation or cross-attention underperform; Multi-Head Channel Attention gives the best fusion of channels.
Model capacity: WAY-base (≈2.0M params) beats larger baselines (≈4.6–5.0M) and even tiny WAY variants (0.24–0.52M) already outperform many baselines—strong parameter efficiency.

ETA Multitask Extension: Adding an ETA regression head (WAY-Mul) cuts arrival time error from human-entered ~4.26 days to ~3.0 days on average, while keeping destination accuracy near single-task levels. However, ETA labels are noisy (waiting times, congestion), so there’s room for improvement.

Takeaway: WAY’s nested representation plus CASP and GD delivers large, consistent gains, especially for early, long-range destination predictions.

05Discussion & Limitations

Limitations:

ETA labels: Human-entered or noisy arrival times (waiting, congestion) limit how well a model can learn ETA; destination is strong, ETA still needs cleaner labels.
Data quality: AIS can include missing values, delays, and human typos; though the annotation/refinement pipeline helps, garbage-in still risks garbage-out.
Grid size trade-off: 1×1-degree grids balance coverage and detail; smaller grids increase compute, larger grids lose nuance. Adaptive grids could help.
New/rare ports: Cold-start destinations with few examples are harder; semantic embeddings help but don’t fully fix scarcity.
External factors: Weather, currents, regulations, and port operations aren’t directly modeled; these can alter routes/times.

Required Resources:

Data: Multi-year global AIS feeds plus a reliable port database with polygons/IDs.
Compute: A modern GPU for training (attention layers), memory for long sequences; efficient batching for variable-length trips.
Engineering: Annotation/refinement pipeline (string similarity, geometric checks, DBSCAN) and careful training with GD.

When NOT to Use:

Tiny local studies with dense, clean, fixed-interval AIS may not need this complexity; a simpler local model could suffice.
Real-time sub-minute micro-maneuvering (e.g., collision avoidance) where fine-grained physics filters/Kalman methods shine.
Scenarios without reliable departure or ship type metadata—semantics matter here.

Open Questions:

Can adaptive or learned grids improve both detail and efficiency?
How much do weather, currents, and traffic conditions improve destination/ETA if integrated?
Can self-supervised pretraining on global AIS help rare ports/routes?
Better ETA labels: Can we separate sailing time from waiting time with new annotations?
Active learning: How to prioritize labeling/cleaning trips that would boost accuracy the most?

06Conclusion & Future Work

Three-Sentence Summary: This paper introduces WAY, a model that predicts ship destinations globally by reorganizing AIS data into a nested sequence that keeps both big-picture grid steps and fine local movements. WAY fuses four channels—spatial identity, local motion, semantics, and time—using channel attention and masked self-attention, and it balances training with Gradient Dropout. On five years of worldwide AIS, WAY outperforms strong baselines, works earlier in voyages, and even reduces ETA error when extended to multitask learning.

Main Achievement: Showing that a nested sequence representation plus channel-aware attention drastically improves long-range, global destination prediction from noisy, irregular AIS—achieving around 80% accuracy and strong early-stage performance.

Future Directions: Add external signals (weather, currents, port queues); design adaptive grids; pretrain on massive unlabeled AIS; build better ETA labels that separate sailing vs. waiting time; improve cold-start ports via meta-learning.

Why Remember This: WAY demonstrates that carefully structuring messy real-world data and fusing the right signals at the right times can unlock big accuracy gains—turning a chaotic stream of ship messages into reliable, early port destination forecasts that help reduce congestion, save fuel, and keep global trade flowing.

Practical Applications

•Early port resource planning: Predict arriving ships and schedule berths, pilots, and cranes in advance.
•Congestion mitigation: Re-route or stagger arrivals when a port is forecast to be overloaded.
•Fuel optimization: Adjust speeds and paths earlier based on likely destinations to save fuel.
•Customs and security pre-clearance: Start paperwork earlier for likely arrivals to reduce dwell time.
•Supply chain planning: Alert shippers/receivers of likely arrival ports to prep trucking and warehousing.
•Maritime domain awareness: Monitor unusual destination patterns for safety or compliance.
•Dynamic ETA updates: Combine WAY’s destination with improved ETA models for rolling arrival forecasts.
•Fleet management: Allocate ships and crews based on predicted flows between common port pairs.
•Insurance and risk assessment: Anticipate traffic density near sensitive areas and adjust coverage.
•Disruption response: During storms or strikes, forecast destination shifts to coordinate contingency plans.

Version: 1