🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way
📚TheoryAdvanced

Transfer Learning Theory

Key Points

  • •
    Transfer learning theory studies when and why a model trained on a source distribution will work on a different target distribution.
  • •
    Domain adaptation bounds decompose target error into source error, a distribution-divergence term, and an irreducible joint-error term.
  • •
    The HΔH-divergence measures how differently two domains make hypotheses in a class disagree; it can be estimated via a domain classifier (proxy A-distance).
  • •
    Under covariate shift (same conditional p(y|x), different p(x)), importance weighting makes source risk match target risk in expectation.
  • •
    Large divergence or large joint-optimum error λ makes transfer hard even if source error is small.
  • •
    Practical estimation uses holdout sets, regularized classifiers, and weight clipping to control variance and overfitting.
  • •
    Feature learning that reduces domain divergence while keeping label prediction accurate is the core of many successful methods.
  • •
    C++ implementations can compute proxy A-distance and perform importance-weighted logistic regression to realize the theory.

Prerequisites

  • →Probability and Random Variables — Understanding distributions, expectations, and conditional probabilities is essential for defining risks and shifts like p(y|x) and p(x).
  • →Statistical Learning Theory Basics — Concepts like hypothesis classes, loss, and generalization bounds underpin transfer learning theory.
  • →Logistic Regression — Used as a simple classifier for both domain discrimination and label prediction in the implementations.
  • →Gradient Descent and Regularization — Required to train logistic models efficiently and stably on finite data.
  • →Density Ratio Estimation — Core to importance weighting under covariate shift; understanding odds-ratio tricks helps implement robust estimators.
  • →Concentration Inequalities — Needed to interpret finite-sample deviations in divergence and risk estimates.
  • →Representation Learning — Many adaptation methods reduce divergence via learned features; knowing this aids practical application.

Detailed Explanation

Tap terms for definitions

01Overview

Transfer learning theory provides mathematical tools to understand when knowledge from a source domain (data distribution) can help on a different target domain. In supervised learning, we usually assume train and test data come from the same distribution; transfer learning breaks this assumption. Domain adaptation—one key setting—assumes labels are scarce or absent in the target domain. The central question becomes: how does the target error of a hypothesis relate to quantities we can measure on the source domain and between domains? Classic results show that target risk can be bounded by source risk plus a measure of domain discrepancy and a term capturing how well the hypothesis class can do on both domains simultaneously. These components guide algorithm design: reduce source error, shrink divergence, and choose a hypothesis class with small joint error. In practice, we estimate divergence by training a classifier to discriminate source from target (proxy A-distance), and we reduce effective divergence by learning domain-invariant features or by reweighting source samples to look like target (importance weighting under covariate shift). The theory also warns us: if domains are too different or labels change across domains, transfer may fail regardless of how well you perform on the source. This perspective yields principled diagnostics and algorithms that align with empirical success in modern deep transfer learning.

02Intuition & Analogies

Imagine you learned to ride a bike in your quiet neighborhood (source). Now you try biking in a busy city (target). Your skills (model) mostly transfer, but some habits fail because the environment differs: more traffic, different rules. Two things matter: (1) how good your biking is in your neighborhood (source error), and (2) how different the city is from your neighborhood (domain divergence). If the city’s rules for balance and pedaling (labeling function) are the same but the traffic density (input distribution) is different, you can adapt by practicing more in conditions that mimic the city—like riding during rush hour (importance weighting). But if the city expects you to ride on the left side (label semantics changed), your old habits may hurt—the best you can do using your current skills is still bad (the joint-optimum error λ is large). Now think of a model that predicts whether an email is spam. If you train on emails from Company A but deploy at Company B, the words people use differ (p(x) changes), even if what counts as spam (p(y|x)) is the same. If you reweight Company A’s emails to emphasize those that look like Company B’s emails, your training will better reflect deployment. To quantify how different the companies’ email styles are, train a domain classifier to tell which company an email came from. If it easily separates them, the domains are far apart; if it struggles, they are similar. The theory says your target error is bounded by source error plus a term that grows with this separability and a term reflecting whether any model in your class can do well on both companies. This sandwich view—performance, difference, and feasibility—provides a map for transfer.

03Formal Definition

Let S and T be distributions over X × Y with marginals SX​ and TX​ on inputs. For a hypothesis h: X → {0,1} and 0–1 loss, define the source and target risks as \(ϵS​(h) = Pr(x,y)∼S​[h(x)= y]\) and \(ϵT​(h) = Pr(x,y)∼T​[h(x)= y]\). For a hypothesis class \(H\), the symmetric-difference hypothesis class is \(HΔH = \{x ↦ h(x)⊕ h'(x): h,h'∈H\}\). The \(HΔH\)-divergence between SX​ and TX​ is \(d_{HΔH}(SX​,TX​) = 2supg∈HΔH​ ∣Prx∼SX​​[g(x)=1]−Prx∼TX​​[g(x)=1]∣\). A central bound (Ben-David et al.) states that for any \(h∈H\), \(ϵT​(h) ≤ ϵS​(h) + 21​ d_{HΔH}(SX​,TX​) + λ\), where \(λ = minh∈H​ ϵS​(h) + ϵT​(h)\) is the joint-optimum error of \(H\). The divergence term captures how differently hypotheses in \(H\) can disagree across the two input distributions, and \(λ\) captures label/conditional shift unattainable by \(H\). Under covariate shift, we assume \(pS​(y∣ x) = pT​(y∣ x)\) while \(pS​(x) = pT​(x)\). Then for any loss \(ℓ\), the target risk equals a weighted source risk: \(ET​[ℓ(h(x),y)] = ES​[w(x)\,ℓ(h(x),y)]\) with density ratio \(w(x)=pS​(x)pT​(x)​\). These definitions enable empirical procedures: estimate divergence via a domain classifier (proxy A-distance) and estimate density ratios to perform importance-weighted ERM.

04When to Use

  • Unlabeled or sparsely labeled target domain: When you have labeled source data but little to no labeled target data, domain adaptation bounds clarify what is and is not possible and guide choices like feature alignment or reweighting.
  • Covariate shift: If the labeling mechanism is stable (same p(y|x)) but the input distribution changes (e.g., training on daytime images, testing at dusk), importance weighting is appropriate and unbiased in expectation.
  • Model selection under shift: Use proxy A-distance to compare candidate feature extractors; prefer representations that reduce divergence without harming source accuracy.
  • Risk diagnostics: If your source error is low but target performance is poor, compute a divergence estimate. A high divergence suggests data/feature mismatch; a low divergence with poor target performance suggests large λ (e.g., label shift or insufficient hypothesis class).
  • Limited compute: When deep adversarial alignment is impractical, light-weight methods like logistic-regression-based proxy A-distance and density ratio weighting can produce strong baselines.
  • Safety-critical deployment: Bounds can motivate collecting a minimal set of labeled target points to estimate λ or to validate assumptions before deployment.

⚠️Common Mistakes

  • Confusing types of shift: Applying covariate-shift importance weighting when label shift (p(y) changes, p(x|y) stable) or concept drift (p(y|x) changes) holds leads to biased training. Diagnose the type of shift first.
  • Overfitting the domain classifier: A too-powerful, unregularized classifier on small data can overestimate divergence. Use holdout validation, regularization, and report uncertainty.
  • Ignoring the joint-optimum term λ: A small divergence does not guarantee good transfer if no hypothesis in your class fits both domains. Consider enriching the hypothesis class or collecting a few target labels.
  • Unstable density-ratio estimates: Ratios can explode in regions where p_S(x) is tiny. Use weight clipping, regularization, and monitor effective sample size to control variance.
  • Data leakage: Accidentally using target labels during feature selection or divergence estimation biases results. Keep target labels isolated unless explicitly allowed.
  • Miscalibrated priors in ratio estimation: When estimating density ratios via a domain classifier, forgetting to account for class priors (imbalance of source vs target samples) biases weights. Balance or correct for priors.
  • Treating proxy A-distance as absolute truth: It is an estimator tied to a hypothesis class and finite samples; report confidence intervals and use it comparatively across methods, not as a sole decision rule.

Key Formulas

Source and Target Risk

ϵS​(h)=(x,y)∼SPr​[h(x)=y],ϵT​(h)=(x,y)∼TPr​[h(x)=y]

Explanation: These define the expected 0–1 loss (misclassification rates) of hypothesis h on the source and target distributions. They are the central quantities transfer learning theory aims to relate.

HΔH-Divergence

dHΔH​(SX​,TX​)=2h,h′∈Hsup​​x∼SX​Pr​[h(x)=h′(x)]−x∼TX​Pr​[h(x)=h′(x)]​

Explanation: This measures how much hypotheses in H can disagree differently across the two domains. Larger values indicate domains are easier to tell apart using functions from H.

Ben-David Adaptation Bound

ϵT​(h)≤ϵS​(h)+21​dHΔH​(SX​,TX​)+λ,λ=h∈Hmin​ϵS​(h)+ϵT​(h)

Explanation: The target error is at most the source error plus half the HΔH-divergence and the best joint error achievable by the hypothesis class. It explains why reducing divergence and choosing a good class are both crucial.

Proxy A-Distance (PAD)

d^A​=2(1−2ϵ^)

Explanation: Given the error \(\hat ϵ\) of a domain classifier on a holdout set, the PAD estimates the HΔH-divergence up to constants. Low domain-classifier accuracy implies small divergence (domains look similar).

Importance Weighting Identity

ET​[ℓ(h(x),y)]=ES​[pS​(x)pT​(x)​ℓ(h(x),y)]

Explanation: Under covariate shift (same conditional p(y|x)), the target risk equals a weighted source risk. This justifies training with weights equal to the density ratio.

Density Ratio via Domain Classifier

w(x)=pS​(x)pT​(x)​≈p^​(z=0∣x)p^​(z=1∣x)​⋅π1​π0​​

Explanation: If a logistic classifier predicts domain label z with priors \(π0​,π1​\) for source/target, then the ratio of predicted odds, adjusted by priors, approximates the density ratio needed for importance weighting.

Effective Sample Size

neff​=∑i=1n​wi2​(∑i=1n​wi​)2​

Explanation: This quantifies how many samples effectively contribute when using weights. A small value indicates high variance due to a few large weights; clipping or regularization may be needed.

Finite-Sample Correction (Informal)

d^HΔH​≤d^A​+O(nlog(1/δ)​​)

Explanation: Empirical divergence estimates deviate from the population value by a sample-complexity term. With more unlabeled samples, the estimate concentrates around the true divergence.

Rademacher Complexity

Rn​(H)=Eσ,X​[h∈Hsup​n1​i=1∑n​σi​h(xi​)]

Explanation: This capacity measure often appears in sharper domain adaptation bounds. Smaller Rademacher complexity implies better generalization from empirical to population quantities.

Time Complexity Notation Example

O(nlogn)

Explanation: Big-O characterizes asymptotic runtime. For instance, O(n log n) means time grows roughly proportional to n times the logarithm of n.

Complexity Analysis

Let ns​, nt​ be the number of source and target samples and d the feature dimension. Training a logistic regression domain classifier with full-batch gradient descent for E epochs costs O(E (ns​ + nt​) d) time, as each epoch computes one pass of gradient accumulation over all samples and updates d parameters. Memory is O((ns​ + nt​) d) to store the data plus O(d) for model parameters. Computing the proxy A-distance then adds only O(ns​ + nt​) time to evaluate the classifier on a holdout split. For importance-weighted ERM, density ratios estimated via the same domain classifier require O(E (ns​ + nt​) d). Training a weighted logistic regression label classifier on source data requires O(E' ns​ d). If you recompute weights each epoch, the total remains linear in the number of samples and features. With weight clipping and precomputed ratios, per-epoch cost is unchanged from unweighted training. The memory footprint is O(ns​ d + d) plus O(ns​) for storing weights. Variance considerations affect effective sample size rather than asymptotic complexity: extreme weights can make optimization noisy and slow to converge. Practically, using mini-batches maintains the same per-step cost O(b d) with batch size b, while often improving wall-clock performance. Regularization adds negligible overhead beyond O(d) per update. Overall, the presented procedures scale linearly with data size and dimension, making them tractable for medium-scale problems in C++ without external dependencies.

Code Examples

Estimating Proxy A-Distance (PAD) via a Logistic Domain Classifier
1#include <bits/stdc++.h>
2using namespace std;
3
4struct LogisticRegression {
5 vector<double> w; // weights
6 double b; // bias
7 int d; // feature dimension
8
9 LogisticRegression(int d_): w(d_, 0.0), b(0.0), d(d_) {
10 // small random init for better symmetry breaking
11 mt19937 rng(42);
12 normal_distribution<double> nd(0.0, 0.01);
13 for (int j = 0; j < d; ++j) w[j] = nd(rng);
14 b = nd(rng);
15 }
16
17 static inline double sigmoid(double z) {
18 if (z >= 0) {
19 double ez = exp(-z);
20 return 1.0 / (1.0 + ez);
21 } else {
22 double ez = exp(z);
23 return ez / (1.0 + ez);
24 }
25 }
26
27 double predict_proba(const vector<double>& x) const {
28 double z = b;
29 for (int j = 0; j < d; ++j) z += w[j] * x[j];
30 return sigmoid(z);
31 }
32
33 int predict(const vector<double>& x, double thresh=0.5) const {
34 return predict_proba(x) >= thresh ? 1 : 0;
35 }
36
37 void fit(const vector<vector<double>>& X, const vector<int>& y,
38 const vector<double>& weights, int epochs=200, double lr=0.1, double l2=1e-4) {
39 int n = (int)X.size();
40 double wsum = 0.0; for (double wi : weights) wsum += wi; if (wsum == 0) wsum = 1.0;
41 for (int ep = 0; ep < epochs; ++ep) {
42 vector<double> gw(d, 0.0);
43 double gb = 0.0;
44 for (int i = 0; i < n; ++i) {
45 double p = predict_proba(X[i]);
46 double err = (p - (double)y[i]) * weights[i]; // gradient of log-loss times weight
47 for (int j = 0; j < d; ++j) gw[j] += err * X[i][j];
48 gb += err;
49 }
50 // normalize by total weight to keep step sizes stable
51 for (int j = 0; j < d; ++j) {
52 gw[j] = gw[j] / wsum + l2 * w[j];
53 w[j] -= lr * gw[j];
54 }
55 gb = gb / wsum; b -= lr * gb;
56 }
57 }
58};
59
60// Generate 2D Gaussian points with given mean
61vector<vector<double>> gaussian_cloud(int n, const vector<double>& mu, double sigma=1.0, unsigned seed=123) {
62 mt19937 rng(seed);
63 normal_distribution<double> ndx(mu[0], sigma);
64 normal_distribution<double> ndy(mu[1], sigma);
65 vector<vector<double>> X(n, vector<double>(2));
66 for (int i = 0; i < n; ++i) {
67 X[i][0] = ndx(rng);
68 X[i][1] = ndy(rng);
69 }
70 return X;
71}
72
73int main() {
74 ios::sync_with_stdio(false);
75 cin.tie(nullptr);
76
77 int nS = 2000, nT = 2000; // unlabeled source/target samples
78 // Source around (0,0), Target shifted along x-axis
79 auto XS = gaussian_cloud(nS, {0.0, 0.0}, 1.0, 7);
80 auto XT = gaussian_cloud(nT, {1.5, 0.0}, 1.0, 13);
81
82 // Build domain dataset: z=0 for source, z=1 for target
83 vector<vector<double>> X;
84 vector<int> z;
85 X.reserve(nS + nT); z.reserve(nS + nT);
86 for (auto &x : XS) { X.push_back(x); z.push_back(0); }
87 for (auto &x : XT) { X.push_back(x); z.push_back(1); }
88
89 // Shuffle and split into train/test for unbiased error
90 vector<int> idx(X.size()); iota(idx.begin(), idx.end(), 0);
91 mt19937 rng(42); shuffle(idx.begin(), idx.end(), rng);
92
93 int n = (int)X.size();
94 int nTrain = (int)(0.7 * n);
95 vector<vector<double>> Xtr, Xte; vector<int> ztr, zte;
96 Xtr.reserve(nTrain); ztr.reserve(nTrain); Xte.reserve(n-nTrain); zte.reserve(n-nTrain);
97 for (int i = 0; i < n; ++i) {
98 if (i < nTrain) { Xtr.push_back(X[idx[i]]); ztr.push_back(z[idx[i]]); }
99 else { Xte.push_back(X[idx[i]]); zte.push_back(z[idx[i]]); }
100 }
101
102 // Train logistic regression as domain classifier with uniform weights
103 int d = 2;
104 LogisticRegression clf(d);
105 vector<double> uni_w(Xtr.size(), 1.0);
106 clf.fit(Xtr, ztr, uni_w, /*epochs=*/400, /*lr=*/0.2, /*l2=*/1e-4);
107
108 // Evaluate error on test split
109 int correct = 0;
110 for (size_t i = 0; i < Xte.size(); ++i) {
111 correct += (clf.predict(Xte[i]) == zte[i]);
112 }
113 double acc = (double)correct / (double)Xte.size();
114 double err = 1.0 - acc;
115
116 // Proxy A-distance: d_A = 2(1 - 2*error). Clamp to [0,2].
117 double dA = 2.0 * (1.0 - 2.0 * err);
118 dA = max(0.0, min(2.0, dA));
119
120 cout << fixed << setprecision(4);
121 cout << "Domain classifier accuracy: " << acc << "\n";
122 cout << "Proxy A-distance estimate d_A: " << dA << " (0=same, 2=very different)\n";
123
124 return 0;
125}
126

We synthesize 2D Gaussian source and target inputs with a mean shift and train a logistic regression to discriminate domains. The proxy A-distance is computed from the domain classifier’s holdout error: low error (easy discrimination) yields a large PAD, indicating substantial distributional difference. This implements the HΔH-based idea with a simple, estimable proxy.

Time: O(E (n_s + n_t) d) for E epochs of full-batch gradient descent; prediction is O((n_s + n_t) d).Space: O((n_s + n_t) d) to store data and O(d) for parameters.
Importance-Weighted Logistic Regression under Covariate Shift
1#include <bits/stdc++.h>
2using namespace std;
3
4struct LogisticRegression {
5 vector<double> w; double b; int d;
6 LogisticRegression(int d_): w(d_, 0.0), b(0.0), d(d_) {
7 mt19937 rng(1); normal_distribution<double> nd(0.0, 0.01);
8 for (int j = 0; j < d; ++j) w[j] = nd(rng); b = nd(rng);
9 }
10 static inline double sigmoid(double z){ if(z>=0){double ez=exp(-z);return 1.0/(1.0+ez);} else {double ez=exp(z);return ez/(1.0+ez);} }
11 double predict_proba(const vector<double>& x) const { double z=b; for(int j=0;j<d;++j) z+=w[j]*x[j]; return sigmoid(z);}
12 int predict(const vector<double>& x, double thr=0.5) const { return predict_proba(x)>=thr?1:0; }
13 void fit(const vector<vector<double>>& X, const vector<int>& y, const vector<double>& weights,
14 int epochs=300, double lr=0.1, double l2=1e-4){
15 int n=(int)X.size(); double wsum=0; for(double wi:weights) wsum+=wi; if(wsum==0) wsum=1.0;
16 for(int ep=0; ep<epochs; ++ep){
17 vector<double> gw(d,0.0); double gb=0.0;
18 for(int i=0;i<n;++i){ double p=predict_proba(X[i]); double err=(p-(double)y[i])*weights[i];
19 for(int j=0;j<d;++j) gw[j]+=err*X[i][j]; gb+=err; }
20 for(int j=0;j<d;++j){ gw[j]=gw[j]/wsum + l2*w[j]; w[j]-=lr*gw[j]; }
21 gb=gb/wsum; b-=lr*gb;
22 }
23 }
24};
25
26// Generate Gaussian X and labels with shared p(y|x)=sigmoid(w* x)
27struct Data { vector<vector<double>> X; vector<int> y; };
28
29Data make_domain(int n, const vector<double>& mu, const vector<double>& wtrue, double sigma=1.0, unsigned seed=7){
30 int d = (int)mu.size();
31 mt19937 rng(seed);
32 vector< normal_distribution<double> > nd; nd.reserve(d);
33 for(int j=0;j<d;++j) nd.emplace_back(mu[j], sigma);
34 uniform_real_distribution<double> ur(0.0, 1.0);
35 Data D; D.X.resize(n, vector<double>(d)); D.y.resize(n);
36 for(int i=0;i<n;++i){
37 double z=0.0; for(int j=0;j<d;++j){ D.X[i][j]=nd[j](rng); z+=wtrue[j]*D.X[i][j]; }
38 double py1 = 1.0/(1.0+exp(-z));
39 D.y[i] = (ur(rng) < py1) ? 1 : 0;
40 }
41 return D;
42}
43
44// Compute density ratio via domain classifier odds, correcting for priors
45vector<double> estimate_density_ratio(const vector<vector<double>>& XS, const vector<vector<double>>& XT){
46 int d = (int)XS[0].size();
47 // Build balanced domain dataset
48 vector<vector<double>> X; vector<int> z; X.reserve(XS.size()+XT.size()); z.reserve(XS.size()+XT.size());
49 for(auto &x: XS){ X.push_back(x); z.push_back(0); }
50 for(auto &x: XT){ X.push_back(x); z.push_back(1); }
51 // Train/test split
52 vector<int> idx(X.size()); iota(idx.begin(), idx.end(), 0); mt19937 rng(123); shuffle(idx.begin(), idx.end(), rng);
53 int nTrain = (int)(0.8*idx.size());
54 vector<vector<double>> Xtr, Xte; vector<int> ztr, zte;
55 for(int i=0;i<(int)idx.size();++i){ if(i<nTrain){ Xtr.push_back(X[idx[i]]); ztr.push_back(z[idx[i]]);} else { Xte.push_back(X[idx[i]]); zte.push_back(z[idx[i]]);} }
56 // Train logistic domain classifier
57 LogisticRegression dom(d); vector<double> wtr(Xtr.size(), 1.0);
58 dom.fit(Xtr, ztr, wtr, 400, 0.2, 1e-4);
59 // Estimate ratios on source points only using classifier odds and prior correction
60 // Priors: if training balanced, pi0=pi1=0.5 => ratio = p(z=1|x)/p(z=0|x)
61 vector<double> ratios; ratios.reserve(XS.size());
62 for(auto &x: XS){ double p1 = dom.predict_proba(x); double p0 = 1.0 - p1; double r = (p0>1e-8)? (p1/p0) : 1e8; ratios.push_back(r); }
63 return ratios;
64}
65
66int main(){
67 ios::sync_with_stdio(false);
68 cin.tie(nullptr);
69
70 int nS=4000, nT=2000; int d=3;
71 vector<double> wtrue = {1.0, -0.5, 0.8};
72 // Covariate shift: different means, same conditional p(y|x)
73 auto S = make_domain(nS, /*muS=*/{0.0, 0.0, 0.0}, wtrue, 1.0, 5);
74 auto T = make_domain(nT, /*muT=*/{1.0, -0.5, 0.5}, wtrue, 1.0, 11);
75
76 // Baseline: train unweighted logistic regression on source
77 LogisticRegression clf_unw(d);
78 vector<double> ones(S.X.size(), 1.0);
79 clf_unw.fit(S.X, S.y, ones, 400, 0.2, 1e-4);
80
81 // Estimate density ratios w(x) ~ p_T(x)/p_S(x) via domain classifier odds
82 vector<double> ratios = estimate_density_ratio(S.X, T.X);
83 // Clip ratios to control variance
84 double clip_max = 10.0;
85 for(double &r : ratios){ if(!isfinite(r)) r = clip_max; r = min(r, clip_max); }
86
87 // Train importance-weighted logistic regression on source
88 LogisticRegression clf_w(d);
89 clf_w.fit(S.X, S.y, ratios, 400, 0.2, 1e-4);
90
91 // Evaluate on target labels (available here for demonstration)
92 auto eval = [&](const LogisticRegression& m){
93 int correct=0; for(size_t i=0;i<T.X.size();++i){ correct += (m.predict(T.X[i])==T.y[i]); }
94 return (double)correct/(double)T.X.size(); };
95
96 double acc_unw = eval(clf_unw);
97 double acc_w = eval(clf_w);
98
99 // Report effective sample size of weighted training
100 double sumw=0, sumw2=0; for(double r: ratios){ sumw+=r; sumw2+=r*r; }
101 double n_eff = (sumw*sumw)/(sumw2+1e-12);
102
103 cout << fixed << setprecision(4);
104 cout << "Target accuracy (unweighted): " << acc_unw << "\n";
105 cout << "Target accuracy (importance-weighted): " << acc_w << "\n";
106 cout << "Effective sample size of weights: " << n_eff << " / " << S.X.size() << "\n";
107
108 return 0;
109}
110

We simulate covariate shift: both domains share the same conditional p(y|x) but have different input means. A logistic domain classifier estimates density ratios via predicted odds, which we clip to reduce variance. Training logistic regression on source with these weights approximates minimizing target risk. We report target accuracy and effective sample size to diagnose stability.

Time: O(E_d (n_s + n_t) d) to fit the domain classifier for ratios plus O(E_h n_s d) to fit the weighted classifier.Space: O((n_s + n_t) d) for data, O(d) for each model, and O(n_s) for stored weights.
#transfer learning#domain adaptation#hΔh-divergence#proxy a-distance#importance weighting#covariate shift#density ratio#joint error lambda#rademacher complexity#logistic regression#distribution shift#generalization bound#weight clipping#effective sample size#domain classifier