📚TheoryAdvanced

Transfer Learning Theory

Key Points

•
Transfer learning theory studies when and why a model trained on a source distribution will work on a different target distribution.
•
Domain adaptation bounds decompose target error into source error, a distribution-divergence term, and an irreducible joint-error term.
•
The HΔH-divergence measures how differently two domains make hypotheses in a class disagree; it can be estimated via a domain classifier (proxy A-distance).
•
Under covariate shift (same conditional p(y|x), different p(x)), importance weighting makes source risk match target risk in expectation.
•
Large divergence or large joint-optimum error λ makes transfer hard even if source error is small.
•
Practical estimation uses holdout sets, regularized classifiers, and weight clipping to control variance and overfitting.
•
Feature learning that reduces domain divergence while keeping label prediction accurate is the core of many successful methods.
•
C++ implementations can compute proxy A-distance and perform importance-weighted logistic regression to realize the theory.

Prerequisites

→Probability and Random Variables — Understanding distributions, expectations, and conditional probabilities is essential for defining risks and shifts like p(y|x) and p(x).
→Statistical Learning Theory Basics — Concepts like hypothesis classes, loss, and generalization bounds underpin transfer learning theory.
→Logistic Regression — Used as a simple classifier for both domain discrimination and label prediction in the implementations.
→Gradient Descent and Regularization — Required to train logistic models efficiently and stably on finite data.
→Density Ratio Estimation — Core to importance weighting under covariate shift; understanding odds-ratio tricks helps implement robust estimators.
→Concentration Inequalities — Needed to interpret finite-sample deviations in divergence and risk estimates.
→Representation Learning — Many adaptation methods reduce divergence via learned features; knowing this aids practical application.

Detailed Explanation

Tap terms for definitions

01Overview

Transfer learning theory provides mathematical tools to understand when knowledge from a source domain (data distribution) can help on a different target domain. In supervised learning, we usually assume train and test data come from the same distribution; transfer learning breaks this assumption. Domain adaptation—one key setting—assumes labels are scarce or absent in the target domain. The central question becomes: how does the target error of a hypothesis relate to quantities we can measure on the source domain and between domains? Classic results show that target risk can be bounded by source risk plus a measure of domain discrepancy and a term capturing how well the hypothesis class can do on both domains simultaneously. These components guide algorithm design: reduce source error, shrink divergence, and choose a hypothesis class with small joint error. In practice, we estimate divergence by training a classifier to discriminate source from target (proxy A-distance), and we reduce effective divergence by learning domain-invariant features or by reweighting source samples to look like target (importance weighting under covariate shift). The theory also warns us: if domains are too different or labels change across domains, transfer may fail regardless of how well you perform on the source. This perspective yields principled diagnostics and algorithms that align with empirical success in modern deep transfer learning.

02Intuition & Analogies

Imagine you learned to ride a bike in your quiet neighborhood (source). Now you try biking in a busy city (target). Your skills (model) mostly transfer, but some habits fail because the environment differs: more traffic, different rules. Two things matter: (1) how good your biking is in your neighborhood (source error), and (2) how different the city is from your neighborhood (domain divergence). If the city’s rules for balance and pedaling (labeling function) are the same but the traffic density (input distribution) is different, you can adapt by practicing more in conditions that mimic the city—like riding during rush hour (importance weighting). But if the city expects you to ride on the left side (label semantics changed), your old habits may hurt—the best you can do using your current skills is still bad (the joint-optimum error λ is large). Now think of a model that predicts whether an email is spam. If you train on emails from Company A but deploy at Company B, the words people use differ (p(x) changes), even if what counts as spam (p(y|x)) is the same. If you reweight Company A’s emails to emphasize those that look like Company B’s emails, your training will better reflect deployment. To quantify how different the companies’ email styles are, train a domain classifier to tell which company an email came from. If it easily separates them, the domains are far apart; if it struggles, they are similar. The theory says your target error is bounded by source error plus a term that grows with this separability and a term reflecting whether any model in your class can do well on both companies. This sandwich view—performance, difference, and feasibility—provides a map for transfer.

03Formal Definition

Let S and T be distributions over X × Y with marginals

S_{X}

and

T_{X}

on inputs. For a hypothesis h: X → {0,1} and 0–1 loss, define the source and target risks as \(

ϵ_{S}

(h) =

Pr_{(x, y) \sim S}

[h(x)

\neq =

y]\) and \(

ϵ_{T}

(h) =

Pr_{(x, y) \sim T}

[h(x)

\neq =

y]\). For a hypothesis class \(

H

\), the symmetric-difference hypothesis class is \(

H Δ H

= \{x

\mapsto

h(x)

\oplus

h'(x): h,h'

\in H

\}\). The \(

H Δ H

\)-divergence between

S_{X}

and

T_{X}

is \(d_{

H Δ H

}(

S_{X}

T_{X}

) = 2

sup_{g \in H Δ H}

∣ Pr_{x \sim S_{X}} [g (x) = 1] - Pr_{x \sim T_{X}} [g (x) = 1] ∣

\). A central bound (Ben-David et al.) states that for any \(h

\in H

\), \(

ϵ_{T}

(h)

\leq

ϵ_{S}

(h) +

\frac{1}{2}

d_{

H Δ H

}(

S_{X}

T_{X}

) +

λ

\), where \(

λ

min_{h \in H}

ϵ_{S}

(h) +

ϵ_{T}

(h)\) is the joint-optimum error of \(

H

\). The divergence term captures how differently hypotheses in \(

H

\) can disagree across the two input distributions, and \(

λ

\) captures label/conditional shift unattainable by \(

H

\). Under covariate shift, we assume \(

p_{S}

∣

x) =

p_{T}

∣

x)\) while \(

p_{S}

(x)

\neq =

p_{T}

(x)\). Then for any loss \(

ℓ

\), the target risk equals a weighted source risk: \(

E_{T}

[

ℓ

(h(x),y)] =

E_{S}

[w(x)\,

ℓ

(h(x),y)]\) with density ratio \(w(x)=

\frac{p _{T} ( x )}{p _{S} ( x )}

\). These definitions enable empirical procedures: estimate divergence via a domain classifier (proxy A-distance) and estimate density ratios to perform importance-weighted ERM.

04When to Use

Unlabeled or sparsely labeled target domain: When you have labeled source data but little to no labeled target data, domain adaptation bounds clarify what is and is not possible and guide choices like feature alignment or reweighting.
Covariate shift: If the labeling mechanism is stable (same p(y|x)) but the input distribution changes (e.g., training on daytime images, testing at dusk), importance weighting is appropriate and unbiased in expectation.
Model selection under shift: Use proxy A-distance to compare candidate feature extractors; prefer representations that reduce divergence without harming source accuracy.
Risk diagnostics: If your source error is low but target performance is poor, compute a divergence estimate. A high divergence suggests data/feature mismatch; a low divergence with poor target performance suggests large λ (e.g., label shift or insufficient hypothesis class).
Limited compute: When deep adversarial alignment is impractical, light-weight methods like logistic-regression-based proxy A-distance and density ratio weighting can produce strong baselines.
Safety-critical deployment: Bounds can motivate collecting a minimal set of labeled target points to estimate λ or to validate assumptions before deployment.

⚠️Common Mistakes

Confusing types of shift: Applying covariate-shift importance weighting when label shift (p(y) changes, p(x|y) stable) or concept drift (p(y|x) changes) holds leads to biased training. Diagnose the type of shift first.
Overfitting the domain classifier: A too-powerful, unregularized classifier on small data can overestimate divergence. Use holdout validation, regularization, and report uncertainty.
Ignoring the joint-optimum term λ: A small divergence does not guarantee good transfer if no hypothesis in your class fits both domains. Consider enriching the hypothesis class or collecting a few target labels.
Unstable density-ratio estimates: Ratios can explode in regions where $p_S$ (x) is tiny. Use weight clipping, regularization, and monitor effective sample size to control variance.
Data leakage: Accidentally using target labels during feature selection or divergence estimation biases results. Keep target labels isolated unless explicitly allowed.
Miscalibrated priors in ratio estimation: When estimating density ratios via a domain classifier, forgetting to account for class priors (imbalance of source vs target samples) biases weights. Balance or correct for priors.
Treating proxy A-distance as absolute truth: It is an estimator tied to a hypothesis class and finite samples; report confidence intervals and use it comparatively across methods, not as a sole decision rule.

Key Formulas

Source and Target Risk

ϵ_{S} (h) = (x, y) \sim S Pr [h (x) \neq = y], ϵ_{T} (h) = (x, y) \sim T Pr [h (x) \neq = y]

Explanation: These define the expected 0–1 loss (misclassification rates) of hypothesis h on the source and target distributions. They are the central quantities transfer learning theory aims to relate.

HΔH-Divergence

d_{H Δ H} (S_{X}, T_{X}) = 2 h, h^{'} \in H sup x \sim S_{X} Pr [h (x) \neq = h^{'} (x)] - x \sim T_{X} Pr [h (x) \neq = h^{'} (x)]

Explanation: This measures how much hypotheses in H can disagree differently across the two domains. Larger values indicate domains are easier to tell apart using functions from H.

Ben-David Adaptation Bound

ϵ_{T} (h) \leq ϵ_{S} (h) + \frac{1}{2} d_{H Δ H} (S_{X}, T_{X}) + λ, λ = h \in H min ϵ_{S} (h) + ϵ_{T} (h)

Explanation: The target error is at most the source error plus half the HΔH-divergence and the best joint error achievable by the hypothesis class. It explains why reducing divergence and choosing a good class are both crucial.

Proxy A-Distance (PAD)

\hat{d}_{A} = 2 (1 - 2 \overset{ϵ}{^})

Explanation: Given the error $ $\hat$ $ϵ$ $ of a domain classifier on a holdout set, the PAD estimates the HΔH-divergence up to constants. Low domain-classifier accuracy implies small divergence (domains look similar).

Importance Weighting Identity

E_{T} [ℓ (h (x), y)] = E_{S} [\frac{p _{T} ( x )}{p _{S} ( x )} ℓ (h (x), y)]

Explanation: Under covariate shift (same conditional p(y|x)), the target risk equals a weighted source risk. This justifies training with weights equal to the density ratio.

Density Ratio via Domain Classifier

w (x) = \frac{p _{T} ( x )}{p _{S} ( x )} \approx \frac{p ^ ( z = 1 ∣ x )}{p ^ ( z = 0 ∣ x )} \cdot \frac{π _{0}}{π _{1}}

Explanation: If a logistic classifier predicts domain label z with priors $ $π_{0}$ , $π_{1}$ $ for source/target, then the ratio of predicted odds, adjusted by priors, approximates the density ratio needed for importance weighting.

Effective Sample Size

n_{eff} = \frac{( \sum _{i = 1}^{n} w _{i} ) ^{2}}{\sum _{i = 1}^{n} w _{i}^{2}}

Explanation: This quantifies how many samples effectively contribute when using weights. A small value indicates high variance due to a few large weights; clipping or regularization may be needed.

Finite-Sample Correction (Informal)

\hat{d}_{H Δ H} \leq \hat{d}_{A} + O (\frac{lo g ( 1/ δ )}{n})

Explanation: Empirical divergence estimates deviate from the population value by a sample-complexity term. With more unlabeled samples, the estimate concentrates around the true divergence.

Rademacher Complexity

R_{n} (H) = E_{σ, X} [h \in H sup \frac{1}{n} i = 1 \sum n σ_{i} h (x_{i})]

Explanation: This capacity measure often appears in sharper domain adaptation bounds. Smaller Rademacher complexity implies better generalization from empirical to population quantities.

Time Complexity Notation Example

O (n lo g n)

Explanation: Big-O characterizes asymptotic runtime. For instance, O(n log n) means time grows roughly proportional to n times the logarithm of n.

Complexity Analysis

Let

n_{s}

n_{t}

be the number of source and target samples and d the feature dimension. Training a logistic regression domain classifier with full-batch gradient descent for E epochs costs O(E (

n_{s}

n_{t}

) d) time, as each epoch computes one pass of gradient accumulation over all samples and updates d parameters. Memory is O((

n_{s}

n_{t}

) d) to store the data plus O(d) for model parameters. Computing the proxy A-distance then adds only O(

n_{s}

n_{t}

) time to evaluate the classifier on a holdout split. For importance-weighted ERM, density ratios estimated via the same domain classifier require O(E (

n_{s}

n_{t}

) d). Training a weighted logistic regression label classifier on source data requires O(E'

n_{s}

d). If you recompute weights each epoch, the total remains linear in the number of samples and features. With weight clipping and precomputed ratios, per-epoch cost is unchanged from unweighted training. The memory footprint is O(

n_{s}

d + d) plus O(

n_{s}

) for storing weights. Variance considerations affect effective sample size rather than asymptotic complexity: extreme weights can make optimization noisy and slow to converge. Practically, using mini-batches maintains the same per-step cost O(b d) with batch size b, while often improving wall-clock performance. Regularization adds negligible overhead beyond O(d) per update. Overall, the presented procedures scale linearly with data size and dimension, making them tractable for medium-scale problems in C++ without external dependencies.

Code Examples

Estimating Proxy A-Distance (PAD) via a Logistic Domain Classifier

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct LogisticRegression {
5     vector<double> w; // weights
6     double b;         // bias
7     int d;            // feature dimension
8 
9     LogisticRegression(int d_): w(d_, 0.0), b(0.0), d(d_) {
10         // small random init for better symmetry breaking
11         mt19937 rng(42);
12         normal_distribution<double> nd(0.0, 0.01);
13         for (int j = 0; j < d; ++j) w[j] = nd(rng);
14         b = nd(rng);
15     }
16 
17     static inline double sigmoid(double z) {
18         if (z >= 0) {
19             double ez = exp(-z);
20             return 1.0 / (1.0 + ez);
21         } else {
22             double ez = exp(z);
23             return ez / (1.0 + ez);
24         }
25     }
26 
27     double predict_proba(const vector<double>& x) const {
28         double z = b;
29         for (int j = 0; j < d; ++j) z += w[j] * x[j];
30         return sigmoid(z);
31     }
32 
33     int predict(const vector<double>& x, double thresh=0.5) const {
34         return predict_proba(x) >= thresh ? 1 : 0;
35     }
36 
37     void fit(const vector<vector<double>>& X, const vector<int>& y,
38              const vector<double>& weights, int epochs=200, double lr=0.1, double l2=1e-4) {
39         int n = (int)X.size();
40         double wsum = 0.0; for (double wi : weights) wsum += wi; if (wsum == 0) wsum = 1.0;
41         for (int ep = 0; ep < epochs; ++ep) {
42             vector<double> gw(d, 0.0);
43             double gb = 0.0;
44             for (int i = 0; i < n; ++i) {
45                 double p = predict_proba(X[i]);
46                 double err = (p - (double)y[i]) * weights[i]; // gradient of log-loss times weight
47                 for (int j = 0; j < d; ++j) gw[j] += err * X[i][j];
48                 gb += err;
49             }
50             // normalize by total weight to keep step sizes stable
51             for (int j = 0; j < d; ++j) {
52                 gw[j] = gw[j] / wsum + l2 * w[j];
53                 w[j] -= lr * gw[j];
54             }
55             gb = gb / wsum; b -= lr * gb;
56         }
57     }
58 };
59 
60 // Generate 2D Gaussian points with given mean
61 vector<vector<double>> gaussian_cloud(int n, const vector<double>& mu, double sigma=1.0, unsigned seed=123) {
62     mt19937 rng(seed);
63     normal_distribution<double> ndx(mu[0], sigma);
64     normal_distribution<double> ndy(mu[1], sigma);
65     vector<vector<double>> X(n, vector<double>(2));
66     for (int i = 0; i < n; ++i) {
67         X[i][0] = ndx(rng);
68         X[i][1] = ndy(rng);
69     }
70     return X;
71 }
72 
73 int main() {
74     ios::sync_with_stdio(false);
75     cin.tie(nullptr);
76 
77     int nS = 2000, nT = 2000; // unlabeled source/target samples
78     // Source around (0,0), Target shifted along x-axis
79     auto XS = gaussian_cloud(nS, {0.0, 0.0}, 1.0, 7);
80     auto XT = gaussian_cloud(nT, {1.5, 0.0}, 1.0, 13);
81 
82     // Build domain dataset: z=0 for source, z=1 for target
83     vector<vector<double>> X;
84     vector<int> z;
85     X.reserve(nS + nT); z.reserve(nS + nT);
86     for (auto &x : XS) { X.push_back(x); z.push_back(0); }
87     for (auto &x : XT) { X.push_back(x); z.push_back(1); }
88 
89     // Shuffle and split into train/test for unbiased error
90     vector<int> idx(X.size()); iota(idx.begin(), idx.end(), 0);
91     mt19937 rng(42); shuffle(idx.begin(), idx.end(), rng);
92 
93     int n = (int)X.size();
94     int nTrain = (int)(0.7 * n);
95     vector<vector<double>> Xtr, Xte; vector<int> ztr, zte;
96     Xtr.reserve(nTrain); ztr.reserve(nTrain); Xte.reserve(n-nTrain); zte.reserve(n-nTrain);
97     for (int i = 0; i < n; ++i) {
98         if (i < nTrain) { Xtr.push_back(X[idx[i]]); ztr.push_back(z[idx[i]]); }
99         else { Xte.push_back(X[idx[i]]); zte.push_back(z[idx[i]]); }
100     }
101 
102     // Train logistic regression as domain classifier with uniform weights
103     int d = 2;
104     LogisticRegression clf(d);
105     vector<double> uni_w(Xtr.size(), 1.0);
106     clf.fit(Xtr, ztr, uni_w, /*epochs=*/400, /*lr=*/0.2, /*l2=*/1e-4);
107 
108     // Evaluate error on test split
109     int correct = 0;
110     for (size_t i = 0; i < Xte.size(); ++i) {
111         correct += (clf.predict(Xte[i]) == zte[i]);
112     }
113     double acc = (double)correct / (double)Xte.size();
114     double err = 1.0 - acc;
115 
116     // Proxy A-distance: d_A = 2(1 - 2*error). Clamp to [0,2].
117     double dA = 2.0 * (1.0 - 2.0 * err);
118     dA = max(0.0, min(2.0, dA));
119 
120     cout << fixed << setprecision(4);
121     cout << "Domain classifier accuracy: " << acc << "\n";
122     cout << "Proxy A-distance estimate d_A: " << dA << " (0=same, 2=very different)\n";
123 
124     return 0;
125 }
126

We synthesize 2D Gaussian source and target inputs with a mean shift and train a logistic regression to discriminate domains. The proxy A-distance is computed from the domain classifier’s holdout error: low error (easy discrimination) yields a large PAD, indicating substantial distributional difference. This implements the HΔH-based idea with a simple, estimable proxy.

Time: O(E (n_s + n_t) d) for E epochs of full-batch gradient descent; prediction is O((n_s + n_t) d).Space: O((n_s + n_t) d) to store data and O(d) for parameters.

Importance-Weighted Logistic Regression under Covariate Shift

1 #include <bits/stdc++.h>
2 using namespace std;
3 
4 struct LogisticRegression {
5     vector<double> w; double b; int d;
6     LogisticRegression(int d_): w(d_, 0.0), b(0.0), d(d_) {
7         mt19937 rng(1); normal_distribution<double> nd(0.0, 0.01);
8         for (int j = 0; j < d; ++j) w[j] = nd(rng); b = nd(rng);
9     }
10     static inline double sigmoid(double z){ if(z>=0){double ez=exp(-z);return 1.0/(1.0+ez);} else {double ez=exp(z);return ez/(1.0+ez);} }
11     double predict_proba(const vector<double>& x) const { double z=b; for(int j=0;j<d;++j) z+=w[j]*x[j]; return sigmoid(z);}    
12     int predict(const vector<double>& x, double thr=0.5) const { return predict_proba(x)>=thr?1:0; }
13     void fit(const vector<vector<double>>& X, const vector<int>& y, const vector<double>& weights,
14              int epochs=300, double lr=0.1, double l2=1e-4){
15         int n=(int)X.size(); double wsum=0; for(double wi:weights) wsum+=wi; if(wsum==0) wsum=1.0;
16         for(int ep=0; ep<epochs; ++ep){
17             vector<double> gw(d,0.0); double gb=0.0;
18             for(int i=0;i<n;++i){ double p=predict_proba(X[i]); double err=(p-(double)y[i])*weights[i];
19                 for(int j=0;j<d;++j) gw[j]+=err*X[i][j]; gb+=err; }
20             for(int j=0;j<d;++j){ gw[j]=gw[j]/wsum + l2*w[j]; w[j]-=lr*gw[j]; }
21             gb=gb/wsum; b-=lr*gb;
22         }
23     }
24 };
25 
26 // Generate Gaussian X and labels with shared p(y|x)=sigmoid(w* x)
27 struct Data { vector<vector<double>> X; vector<int> y; };
28 
29 Data make_domain(int n, const vector<double>& mu, const vector<double>& wtrue, double sigma=1.0, unsigned seed=7){
30     int d = (int)mu.size();
31     mt19937 rng(seed);
32     vector< normal_distribution<double> > nd; nd.reserve(d);
33     for(int j=0;j<d;++j) nd.emplace_back(mu[j], sigma);
34     uniform_real_distribution<double> ur(0.0, 1.0);
35     Data D; D.X.resize(n, vector<double>(d)); D.y.resize(n);
36     for(int i=0;i<n;++i){
37         double z=0.0; for(int j=0;j<d;++j){ D.X[i][j]=nd[j](rng); z+=wtrue[j]*D.X[i][j]; }
38         double py1 = 1.0/(1.0+exp(-z));
39         D.y[i] = (ur(rng) < py1) ? 1 : 0;
40     }
41     return D;
42 }
43 
44 // Compute density ratio via domain classifier odds, correcting for priors
45 vector<double> estimate_density_ratio(const vector<vector<double>>& XS, const vector<vector<double>>& XT){
46     int d = (int)XS[0].size();
47     // Build balanced domain dataset
48     vector<vector<double>> X; vector<int> z; X.reserve(XS.size()+XT.size()); z.reserve(XS.size()+XT.size());
49     for(auto &x: XS){ X.push_back(x); z.push_back(0); }
50     for(auto &x: XT){ X.push_back(x); z.push_back(1); }
51     // Train/test split
52     vector<int> idx(X.size()); iota(idx.begin(), idx.end(), 0); mt19937 rng(123); shuffle(idx.begin(), idx.end(), rng);
53     int nTrain = (int)(0.8*idx.size());
54     vector<vector<double>> Xtr, Xte; vector<int> ztr, zte;
55     for(int i=0;i<(int)idx.size();++i){ if(i<nTrain){ Xtr.push_back(X[idx[i]]); ztr.push_back(z[idx[i]]);} else { Xte.push_back(X[idx[i]]); zte.push_back(z[idx[i]]);} }	
56     // Train logistic domain classifier
57     LogisticRegression dom(d); vector<double> wtr(Xtr.size(), 1.0);
58     dom.fit(Xtr, ztr, wtr, 400, 0.2, 1e-4);
59     // Estimate ratios on source points only using classifier odds and prior correction
60     // Priors: if training balanced, pi0=pi1=0.5 => ratio = p(z=1|x)/p(z=0|x)
61     vector<double> ratios; ratios.reserve(XS.size());
62     for(auto &x: XS){ double p1 = dom.predict_proba(x); double p0 = 1.0 - p1; double r = (p0>1e-8)? (p1/p0) : 1e8; ratios.push_back(r); }
63     return ratios;
64 }
65 
66 int main(){
67     ios::sync_with_stdio(false);
68     cin.tie(nullptr);
69 
70     int nS=4000, nT=2000; int d=3;
71     vector<double> wtrue = {1.0, -0.5, 0.8};
72     // Covariate shift: different means, same conditional p(y|x)
73     auto S = make_domain(nS, /*muS=*/{0.0, 0.0, 0.0}, wtrue, 1.0, 5);
74     auto T = make_domain(nT, /*muT=*/{1.0, -0.5, 0.5}, wtrue, 1.0, 11);
75 
76     // Baseline: train unweighted logistic regression on source
77     LogisticRegression clf_unw(d);
78     vector<double> ones(S.X.size(), 1.0);
79     clf_unw.fit(S.X, S.y, ones, 400, 0.2, 1e-4);
80 
81     // Estimate density ratios w(x) ~ p_T(x)/p_S(x) via domain classifier odds
82     vector<double> ratios = estimate_density_ratio(S.X, T.X);
83     // Clip ratios to control variance
84     double clip_max = 10.0;
85     for(double &r : ratios){ if(!isfinite(r)) r = clip_max; r = min(r, clip_max); }
86 
87     // Train importance-weighted logistic regression on source
88     LogisticRegression clf_w(d);
89     clf_w.fit(S.X, S.y, ratios, 400, 0.2, 1e-4);
90 
91     // Evaluate on target labels (available here for demonstration)
92     auto eval = [&](const LogisticRegression& m){
93         int correct=0; for(size_t i=0;i<T.X.size();++i){ correct += (m.predict(T.X[i])==T.y[i]); }
94         return (double)correct/(double)T.X.size(); };
95 
96     double acc_unw = eval(clf_unw);
97     double acc_w   = eval(clf_w);
98 
99     // Report effective sample size of weighted training
100     double sumw=0, sumw2=0; for(double r: ratios){ sumw+=r; sumw2+=r*r; }
101     double n_eff = (sumw*sumw)/(sumw2+1e-12);
102 
103     cout << fixed << setprecision(4);
104     cout << "Target accuracy (unweighted): " << acc_unw << "\n";
105     cout << "Target accuracy (importance-weighted): " << acc_w << "\n";
106     cout << "Effective sample size of weights: " << n_eff << " / " << S.X.size() << "\n";
107 
108     return 0;
109 }
110

We simulate covariate shift: both domains share the same conditional p(y|x) but have different input means. A logistic domain classifier estimates density ratios via predicted odds, which we clip to reduce variance. Training logistic regression on source with these weights approximates minimizing target risk. We report target accuracy and effective sample size to diagnose stability.

Time: O(E_d (n_s + n_t) d) to fit the domain classifier for ratios plus O(E_h n_s d) to fit the weighted classifier.Space: O((n_s + n_t) d) for data, O(d) for each model, and O(n_s) for stored weights.

1	#include <bits/stdc++.h>
2	using namespace std;
3
4	struct LogisticRegression {
5	vector<double> w; // weights
6	double b; // bias
7	int d; // feature dimension
8
9	LogisticRegression(int d_): w(d_, 0.0), b(0.0), d(d_) {
10	// small random init for better symmetry breaking
11	mt19937 rng(42);
12	normal_distribution<double> nd(0.0, 0.01);
13	for (int j = 0; j < d; ++j) w[j] = nd(rng);
14	b = nd(rng);
15	}
16
17	static inline double sigmoid(double z) {
18	if (z >= 0) {
19	double ez = exp(-z);
20	return 1.0 / (1.0 + ez);
21	} else {
22	double ez = exp(z);
23	return ez / (1.0 + ez);
24	}
25	}
26
27	double predict_proba(const vector<double>& x) const {
28	double z = b;
29	for (int j = 0; j < d; ++j) z += w[j] * x[j];
30	return sigmoid(z);
31	}
32
33	int predict(const vector<double>& x, double thresh=0.5) const {
34	return predict_proba(x) >= thresh ? 1 : 0;
35	}
36
37	void fit(const vector<vector<double>>& X, const vector<int>& y,
38	const vector<double>& weights, int epochs=200, double lr=0.1, double l2=1e-4) {
39	int n = (int)X.size();
40	double wsum = 0.0; for (double wi : weights) wsum += wi; if (wsum == 0) wsum = 1.0;
41	for (int ep = 0; ep < epochs; ++ep) {
42	vector<double> gw(d, 0.0);
43	double gb = 0.0;
44	for (int i = 0; i < n; ++i) {
45	double p = predict_proba(X[i]);
46	double err = (p - (double)y[i]) * weights[i]; // gradient of log-loss times weight
47	for (int j = 0; j < d; ++j) gw[j] += err * X[i][j];
48	gb += err;
49	}
50	// normalize by total weight to keep step sizes stable
51	for (int j = 0; j < d; ++j) {
52	gw[j] = gw[j] / wsum + l2 * w[j];
53	w[j] -= lr * gw[j];
54	}
55	gb = gb / wsum; b -= lr * gb;
56	}
57	}
58	};
59
60	// Generate 2D Gaussian points with given mean
61	vector<vector<double>> gaussian_cloud(int n, const vector<double>& mu, double sigma=1.0, unsigned seed=123) {
62	mt19937 rng(seed);
63	normal_distribution<double> ndx(mu[0], sigma);
64	normal_distribution<double> ndy(mu[1], sigma);
65	vector<vector<double>> X(n, vector<double>(2));
66	for (int i = 0; i < n; ++i) {
67	X[i][0] = ndx(rng);
68	X[i][1] = ndy(rng);
69	}
70	return X;
71	}
72
73	int main() {
74	ios::sync_with_stdio(false);
75	cin.tie(nullptr);
76
77	int nS = 2000, nT = 2000; // unlabeled source/target samples
78	// Source around (0,0), Target shifted along x-axis
79	auto XS = gaussian_cloud(nS, {0.0, 0.0}, 1.0, 7);
80	auto XT = gaussian_cloud(nT, {1.5, 0.0}, 1.0, 13);
81
82	// Build domain dataset: z=0 for source, z=1 for target
83	vector<vector<double>> X;
84	vector<int> z;
85	X.reserve(nS + nT); z.reserve(nS + nT);
86	for (auto &x : XS) { X.push_back(x); z.push_back(0); }
87	for (auto &x : XT) { X.push_back(x); z.push_back(1); }
88
89	// Shuffle and split into train/test for unbiased error
90	vector<int> idx(X.size()); iota(idx.begin(), idx.end(), 0);
91	mt19937 rng(42); shuffle(idx.begin(), idx.end(), rng);
92
93	int n = (int)X.size();
94	int nTrain = (int)(0.7 * n);
95	vector<vector<double>> Xtr, Xte; vector<int> ztr, zte;
96	Xtr.reserve(nTrain); ztr.reserve(nTrain); Xte.reserve(n-nTrain); zte.reserve(n-nTrain);
97	for (int i = 0; i < n; ++i) {
98	if (i < nTrain) { Xtr.push_back(X[idx[i]]); ztr.push_back(z[idx[i]]); }
99	else { Xte.push_back(X[idx[i]]); zte.push_back(z[idx[i]]); }
100	}
101
102	// Train logistic regression as domain classifier with uniform weights
103	int d = 2;
104	LogisticRegression clf(d);
105	vector<double> uni_w(Xtr.size(), 1.0);
106	clf.fit(Xtr, ztr, uni_w, /epochs=/400, /lr=/0.2, /l2=/1e-4);
107
108	// Evaluate error on test split
109	int correct = 0;
110	for (size_t i = 0; i < Xte.size(); ++i) {
111	correct += (clf.predict(Xte[i]) == zte[i]);
112	}
113	double acc = (double)correct / (double)Xte.size();
114	double err = 1.0 - acc;
115
116	// Proxy A-distance: d_A = 2(1 - 2*error). Clamp to [0,2].
117	double dA = 2.0 * (1.0 - 2.0 * err);
118	dA = max(0.0, min(2.0, dA));
119
120	cout << fixed << setprecision(4);
121	cout << "Domain classifier accuracy: " << acc << "\n";
122	cout << "Proxy A-distance estimate d_A: " << dA << " (0=same, 2=very different)\n";
123
124	return 0;
125	}
126