Replication package for “Dealing with censored earnings in register data”

authored by Mattis Beckmannshagen, Johannes König, Isabella Retter, Christian Schluter, Carsten Schröder, Yogam Tchokni, published in Journal of Economics and Statistics, 2025. https://doi.org/10.1515/jbnst-2024-0037

Software

The core functionality of the beyondpareto command used for imputation of the censored right tail is described in the paper König, Schluter, Schröder, Retter, and Beckmannshagen (2025), “The beyondpareto command for optimal extreme value index estimation,” The Stata Journal, 25(1) May 2025. SJ 2025.
Installation: The package can be downloaded from The Stata Journal as per usual, net sj 25-1.

Version control

Typing which beyondpareto yields

*! beyondpareto.ado, version 1.0, 2024-06-14
*! Christian Schluter and Isabella Retter
*! To learn more about beyondpareto see the vignette at
*! https://christianschluter.github.io/beyondpareto/

Data accessibility

The data set used here is the German Socio-Economic Panel. We are not allowed to share the data. However, it is available, free of charge, to researchers worldwide, upon entering into a data contract, see details. The full replication code (data generation, all computations) is deposited in the Reanalysis archive of the SOEP Research Data Center (SOEP-RDC).

2.1 In the lab: A parametric earnings distribution

GB2fit to SOEP data (men, west, 2018)

We illustrate the main procedure (and a proof of concept and performance evidence) using artificial data. So that the model be realistic, we fit the parametric model to monthly earnings data taken from the SOEP for male workers in West Germany in 2018, stored in the variable inc_soep. Specifically, we fit the data using a highly flexible distribution, the generalized beta distribution of the second kind (GB2), with the density function \[ GB2(y;a,b,p,q)={\frac {|a|y^{ap-1}}{b^{ap}B(p,q)(1+(y/b)^{a})^{p+q}}}, \quad \quad \quad (y,a,b,p,q >0).\]

Code

. gb2fit inc_soep , svy stats from(4 3500 0.7 0.8, copy) pdf(pdfgb2) poorfrac(6370)

Initial:      Log pseudolikelihood = -1.068e+08
Rescale:      Log pseudolikelihood = -1.068e+08
Rescale eq:   Log pseudolikelihood = -1.068e+08
Iteration 0:  Log pseudolikelihood = -1.068e+08  
Iteration 1:  Log pseudolikelihood = -1.067e+08  
Iteration 2:  Log pseudolikelihood = -1.067e+08  
Iteration 3:  Log pseudolikelihood = -1.067e+08  
Iteration 4:  Log pseudolikelihood = -1.067e+08  
Iteration 5:  Log pseudolikelihood = -1.067e+08  

ML fit of GB2 distribution

pweight:  phrf                                    Number of obs    =      4444
Strata:   <one>                                   Number of strata =         1
PSU:      <observations>                          Number of PSUs   =      4444
                                                  Population size  =  11939209
                                                  F(   0,   4444)  =         .
                                                  Prob > F         =         .

─────────────┬────────────────────────────────────────────────────────────────
    inc_soep │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
a            │
       _cons │   4.037141   .6002606     6.73   0.000     2.860331     5.21395
─────────────┼────────────────────────────────────────────────────────────────
b            │
       _cons │   3633.295   135.1821    26.88   0.000     3368.271    3898.319
─────────────┼────────────────────────────────────────────────────────────────
p            │
       _cons │   .7572215   .1397604     5.42   0.000     .4832216    1.031222
─────────────┼────────────────────────────────────────────────────────────────
q            │
       _cons │   .8250917   .1767598     4.67   0.000     .4785543    1.171629
─────────────┴────────────────────────────────────────────────────────────────
 
Fraction with  inc_soep < 6370 =   0.88780
 
────────────────────────────────────────────────────────────
     Quantiles       Cumulative shares of
                     total inc_soep (Lorenz ordinates)
────────────────────────────────────────────────────────────
 1%  849.49997       0.00158
 5%   1.44e+03       0.01343
10%   1.83e+03       0.03382
20%   2.34e+03       0.08576
25%   2.55e+03       0.11605
30%   2.75e+03       0.14885
40%   3.13e+03       0.22158 Mode         3.02e+03
50%   3.52e+03       0.30374 Mean         4.04e+03
60%   3.95e+03       0.39598 Std. Dev.    2.66e+03
70%   4.48e+03       0.50000
75%   4.82e+03       0.55744 Variance     7.08e+06
80%   5.23e+03       0.61945 Half CV^2     0.21646
90%   6.61e+03       0.76357
95%   8.23e+03       0.85389 p90/p10       3.61841
99%   1.34e+04       0.95251 p75/p25       1.88657
────────────────────────────────────────────────────────────

.         
. twoway  (hist inc_soep [fw=int(phrf)] if inc_soep<20000, width(250) lcol(black) fcol(none)) (connected  pdfgb2 inc_soe
> p if inc_soep<20000, msize(vtiny) mc(red) lc(red) msymbol(circle_hollow)), xline(6370, lc(gs10) lp(dash)) legend(off) 
> xtitle(SOEP earnings) ytitle(Probability density function) name(male, replace) xscale(range(0 17500)) xlabel(0 (5000) 
> 15000) yscale(range(0 0.0004)) ylabel(0 (.0001) .0004)

Figure 1: GB2-fit for earnings of West German men in 2018

We then use the fitted GB2 model as our data generating process (DGP), with true tail parameter \(\gamma = .304\) and an implied true earnings share for the top 1% of 4.8%.

Figure 2 illustrates the estimator of \(\gamma\) for one such random sample using beyondpareto. Panel (A) depicts the Pareto QQ-plot for the upper earnings tail and shows its approximate linearity. We fit a straight line with slope \(\widehat{\gamma}\); the vertical dashed line depicts the value corresponding to the optimally determined \(Y_{base}\). In panel (B), we depict \(\widehat{\gamma}\) as a function of the upper order statistics \(k\), and the optimal \(k^*\) corresponds to \(Y_{base}\). Beyond this \(k^*\), the estimator is increasingly distorted.

Code

. clear

. set seed 330033

. local a = 3633.295

. local b = 4.037

. local p = 0.757

. local q = 0.825

. qui set obs 10000

. qui gen double income =  `a'*( (1/invibeta(`p',`q',runiform()))-1  )^(-1/`b')

. beyondpareto income, plot(pareto) nrange(1000,3000)
No sampling weights given. Using w=1.
Considering the top 3000 of 10000 values, starting from the first 1000 and testing 2001 values for k base.

──────────┬───────────────────────────────────────────────────────────
  Results │      Ybase       gamma        S.E.  [95% Conf.   Interval]
──────────┼───────────────────────────────────────────────────────────
   income │  5810.3139   .32328834    .0091105    .3054318    .3411449
──────────┴───────────────────────────────────────────────────────────
Optimal k base: 1574

. gr_edit title.text.Arrpush A: Pareto QQ-plot

. qui beyondpareto income, plot(gamma) nrange(1000,3000)

. gr_edit title.text.Arrpush B: Extreme value index

. graph combine QQ gamma

. graph display, xsize(12) ysize(5)

. graph export gb2_plots.png, replace
file gb2_plots.png saved as PNG format

Figure 2: Diagnostic plots of beyondpareto in GB2 example — Figure 2: Diagnostic plots of `beyondpareto` in GB2 example

Monte Carlo simulation

Is the good fit for this one synthetic distribution just a coincidence? To answer this question, we perform a Monte Carlo study, generating 1,000 synthetic distributions of size 10,000. For each synthetic distribution, we estimate the extreme value index γ. For this task, the values for the parameters a,b,p,q that were estimated above,

\(a = 4.037,\ b =3633.295, \ p =0.757, \ q =0.825,\)

have to be inserted into beyondpareto_GB2sim.ado which is distributed with this replication code.

Bottom line: In this lab setting, beyondpareto performs very well: the estimated values are very close to the true population values and the empirical standard deviations across simulations suggest tight estimates.

3 Data

Comparability of earnings in SOEP and register data

Figure 3 is a scatter plot to show that insurance accounts (IA) earnings and self-reported SOEP earnings are highly correlated. If high correlation below the censoring threshold is given, we can use survey earnings as a good benchmark also for (imputed) earnings above the censoring threshold. As an example, we plot the earnings in the year 2018 for record- linked men in West Germany with IA earnings below the assessment ceiling of €6,370. The black line is the 45-degree bisector, the red functional form shows the result of a locally weighted regression—and thus a highly flexible form. The figure shows that both earnings variables are highly correlated with a correlation coefficient \(\rho\) of 0.92.

Figure 3: Scatter Plot of IA and SOEP earnings below the censoring threshold — Figure 3: Scatter Plot of IA and SOEP earnings *below* the censoring threshold

Bottom line: The two earnings concepts are highly correlated. SOEP-RV can thus be considered an ideal laboratory to assess the performance of imputation methods.

4 From the lab to the field

Our benchmarking of the beyondpareto approach relies on a comparison of three scenarios, illustrated in Figure 4:

The left-hand panel of Figure 4 shows the distribution of uncensored SOEP earnings. In scenario 1, the target scenario, beyondpareto determines the optimal \(Y_{base}\) within the whole range depicted by the shaded areas A and B. In the restricted scenario, beyondpareto will only consider area A, below the censoring threshold. The right-hand panel show the artificially censored SOEP earnings, including the mass point at the censoring threshold. The area under consideration for beyondpareto in the censored setting is again displayed by the shaded area.

4.1 Top earnings: The SOEP sample

We now apply beyondpareto in the three described scenarios for the case of 2018, based on the full SOEP sample. In the paper, we also consider the years 2014-2017 and 2019.

Code

. use ${temp}soep_west.dta, clear
(SOEP-Core, v38.1 (EU Edition), doi:10.5684/soep.core.v38.1eu)

. keep if JAHR ==2018 & syear==2018 // only use 2018
(57,313 observations deleted)

. keep if sex==1 // focus on men
(4,227 observations deleted)

. gen weight= phrf

. sum BMG_WEST if syear ==2018

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
    BMG_WEST │      4,444        6500           0       6500       6500

. global sc_limit=r(mean)    

. global lm = 0.98 * $sc_limit

. mat out=J(5,6,.)

. **************************************************
. *(a) target/unrestricted
. **************************************************
. preserve

. collapse (sum) weight, by(inc_soep)

. *beyondpareto inc_soep [w=weight] , nrange(20, 1000) 
. *diagnose plot (here figure 5 is produced)
. beyondpareto inc_soep [w=weight] , nrange(20, 1000) plot(pareto) save(${out}4a_pareto.png, width(1200) replace)
(sampling weights assumed)
Using the given sampling weights.
Considering the top 1000 of 1851 values, starting from the first 20 and testing 981 values for k base.

──────────┬───────────────────────────────────────────────────────────
  Results │      Ybase       gamma        S.E.  [95% Conf.   Interval]
──────────┼───────────────────────────────────────────────────────────
 inc_soep │  7991.6665   .23670794    .0183943    .2006552    .2727607
──────────┴───────────────────────────────────────────────────────────
Optimal k base: 207
Saving plots

. gr_edit title.text.Arrpush A: Pareto QQ-plot

. beyondpareto inc_soep [w=weight] , nrange(20, 1000) plot(gamma) save(${out}4a_gamma.png, width(1200) replace)
(sampling weights assumed)
Using the given sampling weights.
Considering the top 1000 of 1851 values, starting from the first 20 and testing 981 values for k base.

──────────┬───────────────────────────────────────────────────────────
  Results │      Ybase       gamma        S.E.  [95% Conf.   Interval]
──────────┼───────────────────────────────────────────────────────────
 inc_soep │  7991.6665   .23670794    .0183943    .2006552    .2727607
──────────┴───────────────────────────────────────────────────────────
Optimal k base: 207
Saving plots

. gr_edit title.text.Arrpush B: Extreme value index

. addplot: , xline(430, lp(tight_dot) lc(black))

. graph combine QQ gamma, iscale(1)

. graph display, xsize(12) ysize(5)

. graph export 4a_pareto_gamma_comb.png, replace
file 4a_pareto_gamma_comb.png saved as PNG format

. mat out[2,1] =  e(Ybase)

. mat out[2,2] = e(gamma)

. // sum of all weights, of weighted tail, and of all censored individuals
. sum weight if inc_soep >e(Ybase)

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │        206    3616.483    5894.219      37.17   56524.53

. scalar w_flanke=r(sum)

. sum weight if inc_soep >=$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │        370    3935.221    5667.319      37.17   56524.53

. scalar w_cens=r(sum)

. sum weight if !missing(inc_soep)

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │      1,851     6450.14    14708.13       13.5   220120.6

. scalar w_all=r(sum)

. scalar rel_k_opt=w_flanke/w_all -0.0000001

. scalar rel_cens=w_cens/w_all

. // store shares
. beyondpareto_topshares inc_soep [w=weight], shares(0.01 0.05 0.1) matname(topshares)
(sampling weights assumed)
Using the following results for gamma: .23670794 and Ybase: 7991.6665
Using the given sampling weights.

. mat out[2,3] = e(topshares)[1,2]

. mat out[2,4] = e(topshares)[2,2]

. // Top 10 percent also include individuals outside of the tail. Therefore we have to calculate the earnings share of t
> he tail (based on beyondpareto_topshares) and of the individuals between the 90th percentile and the beginning of the 
> tail
. *compute share of tail (excluding ybase)
. beyondpareto_topshares inc_soep [w=weight], shares(`=rel_k_opt') matname("tail")
(sampling weights assumed)
Using the following results for gamma: .23670794 and Ybase: 7991.6665
Using the given sampling weights.

. * compute share between BBG/p90 and tail
. sum inc_soep [w=weight],d
(analytic weights assumed)

                          inc_soep
─────────────────────────────────────────────────────────────
      Percentiles      Smallest
 1%     708.3333       454.1667
 5%         1500            460
10%         1900          462.5       Obs               1,851
25%     2516.667       462.5833       Sum of wgt.  11939208.7

50%         3500                      Mean           4021.874
                        Largest       Std. dev.      2337.413
75%     4895.833       23133.33
90%     6733.333       24916.67       Variance        5463498
95%         8350       25833.33       Skewness        2.43301
99%     12866.67          41000       Kurtosis        15.6383

. scalar inc_total = r(sum)

. scalar p90=r(p90)

. sum inc_soep if inc_soep>=$lm & inc_soep<=e(Ybase) [w=weight]
(analytic weights assumed)

    Variable │     Obs      Weight        Mean   Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────────────
    inc_soep │     164  711036.419    6981.765   418.4139       6375   7991.667

. scalar share_lm_bbg = r(sum)/inc_total

. sum inc_soep if inc_soep>=p90 & inc_soep<=e(Ybase) [w=weight]
(analytic weights assumed)

    Variable │     Obs      Weight        Mean   Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────────────
    inc_soep │     118  458627.469    7216.794   330.2441   6733.333   7991.667

. scalar share_p90_ybase=r(sum)/inc_total

. * add up share of tail with the respective share below
. mat out[2,5] =  e(tail)[1,2] + 100*share_p90_ybase

. mat out[2,6] = e(tail)[1,2] + 100*share_lm_bbg

. restore

. **************************************************
. *(b) restricted, Ybase forced to be below BBG by setting k min left
. **************************************************
. preserve

. collapse (sum) weight, by(inc_soep)

. sum inc_soep if inc_soep>0.98*$sc_limit

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
    inc_soep │        370    9480.188    3787.074       6375      41000

. local n_cens_soep=r(N)

. sum inc_soep

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
    inc_soep │      1,851    4597.373    3240.883   454.1667      41000

. local n_total = r(N)

. local min_b = `n_cens_soep' + 20

. beyondpareto inc_soep [w=weight] , nrange(`min_b', 1000) 
(sampling weights assumed)
Using the given sampling weights.
Considering the top 1000 of 1851 values, starting from the first 390 and testing 611 values for k base.

──────────┬───────────────────────────────────────────────────────────
  Results │      Ybase       gamma        S.E.  [95% Conf.   Interval]
──────────┼───────────────────────────────────────────────────────────
 inc_soep │       6200   .26419235    .0148059    .2351729    .2932118
──────────┴───────────────────────────────────────────────────────────
Optimal k base: 398

. mat out[3,1] = e(Ybase)

. mat out[3,2] = e(gamma)

. // store shares
. beyondpareto_topshares inc_soep [w=weight], shares(0.01, 0.05, 0.10) matname("topshares")
(sampling weights assumed)
Using the following results for gamma: .26419235 and Ybase: 6200
Using the given sampling weights.

. mat out[3,3] = e(topshares)[1,2]

. mat out[3,4] = e(topshares)[2,2]

. mat out[3,5] = e(topshares)[3,2]

. sum weight if inc_soep >=$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │        370    3935.221    5667.319      37.17   56524.53

. scalar w_cens=r(sum)

. sum weight if !missing(inc_soep)

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │      1,851     6450.14    14708.13       13.5   220120.6

. scalar w_all=r(sum)

. scalar rel_cens=w_cens/w_all

. *compute share of censored
. beyondpareto_topshares inc_soep [w=weight], shares(`=rel_cens') matname("censored")
(sampling weights assumed)
Using the following results for gamma: .26419235 and Ybase: 6200
Using the given sampling weights.

. mat out[3,6] = e(censored)[1,2]

. restore

. **************************************************
. *(c) (artificially) censored
. **************************************************
. preserve

. collapse (sum) weight, by(inc_soep)

. * artificial censoring and correction of mass point (adding the weight of the censored to the last uncensored obs.)
. sum weight if inc_soep>=$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │        370    3935.221    5667.319      37.17   56524.53

. local sumwcens = r(sum)

. sum inc_soep if inc_soep<$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
    inc_soep │      1,481    3377.494    1450.441   454.1667   6366.667

. replace weight = weight + `sumwcens' if inc_soep==r(max)
(1 real change made)

. replace inc_soep=. if inc_soep>=$lm
(370 real changes made, 370 to missing)

. beyondpareto inc_soep [w=weight] , nrange(20, 1000) 
(sampling weights assumed)
(370 missing values generated)
Using the given sampling weights.
Considering the top 1000 of 1481 values, starting from the first 20 and testing 981 values for k base.

──────────┬───────────────────────────────────────────────────────────
  Results │      Ybase       gamma        S.E.  [95% Conf.   Interval]
──────────┼───────────────────────────────────────────────────────────
 inc_soep │  6239.1665   .26662029    .0621562    .1447942    .3884464
──────────┴───────────────────────────────────────────────────────────
Optimal k base: 23

. mat out[4,1] = e(Ybase)

. mat out[4,2] = e(gamma)

. beyondpareto_topshares inc_soep [w=weight], shares(0.01, 0.05, 0.10) matname("shares")
(sampling weights assumed)
Using the following results for gamma: .26662029 and Ybase: 6239.1665
(370 missing values generated)
Using the given sampling weights.

. // store shares
. mat out[4,3] = e(shares)[1,2]

. mat out[4,4] = e(shares)[2,2]

. mat out[4,5] = e(shares)[3,2]

. sum weight if inc_soep >=$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │        370    3935.221    5667.319      37.17   56524.53

. scalar w_cens=r(sum)

. sum weight if !missing(inc_soep)

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │      1,481    8061.586    41008.82       13.5    1457932

. scalar w_all=r(sum)

. scalar rel_k_opt=w_flanke/w_all

. scalar rel_cens=w_cens/w_all

. *compute share of censored
. beyondpareto_topshares inc_soep [w=weight], shares(`=rel_cens') matname("censored")
(sampling weights assumed)
Using the following results for gamma: .26662029 and Ybase: 6239.1665
(370 missing values generated)
Using the given sampling weights.

. mat out[4,6] = e(censored)[1,2]

. mat out2=out[2..4,1..6]

. restore

. frmttable using table3.tex, statmat(out2) multicol(1,2,2;1,4,4) rtitle("Target (unrestricted)" \ "Restricted"\"Censore
> d") ctitle("", "Tail metrics" , "" , "Top earnings shares", "", "", "" \"", " $ Y_{base}$", " $\hat{\gamma} $" ,"1\%",
> "5\%","10\%", "Cens.\%") tex fr replace sdec(0,2,2,2,2,2)

         ─────────────────────────────────────────────────────────────────────────────────────────────────────
                                  Tail metrics                    Top earnings shares                        
                                   $ Y_{base}$   $\hat{\gamma} $          1\%           5\%   10\%   Cens.\% 
         ─────────────────────────────────────────────────────────────────────────────────────────────────────
          Target (unrestricted)      7,992            0.24               4.01          13.71  23.13   26.57  
          Restricted                 6,200            0.26               4.16          13.59  22.63   26.19  
          Censored                   6,239            0.27               4.21          13.70  22.78   26.35  
         ─────────────────────────────────────────────────────────────────────────────────────────────────────

This produces Table 3: Tail estimation in the SOEP sample. \[\begin{array}{lcccccc} \hline & \text{Tail metrics} & & & \text{Top earnings shares} & & \\ & Y\_{base} & \hat{\gamma} & 1\% & 5\% & 10\% & \text{at cens.}\%\\ \text{Target (unrestricted)} & 7,992 & 0.24 & 4.01 & 13.71 & 23.13 & 26.57\\ \text{Restricted} & 6,200 & 0.26 & 4.16 & 13.59 & 22.63 & 26.19\\ \text{Censored} & 6,239 & 0.27 & 4.21 & 13.70 & 22.78 & 26.35\\ \hline\end{array}\]

Figure 5 shows the diagnostic plots of beyondpareto for the target scenario in the SOEP sample. The dotted line (Panel B) represents the optimal Y_base in the restricted setting.

Figure 5: Diagnostic plots of beyondpareto in the SOEP sample — Figure 5: Diagnostic plots of `beyondpareto` in the SOEP sample

Bottom line: From Table 3, we conclude that censoring of the data at the assessment ceiling does not prevent us from achieving a very good approximation of the tail of the earnings distribution.

4.2 Top earnings: The SOEP-RV sample

We now consider the smaller SOEP-RV sample and report the results in Table 5.

Code

. use ${temp}soep_rv_west.dta, clear
(SOEP-Core, v38.1 (EU Edition), doi:10.5684/soep.core.v38.1eu)

. keep if JAHR ==2018 // only use 2018
(0 observations deleted)

. keep if sex==1 // focus on men
(1,997 observations deleted)

. gen weight= phrf

. *Output matrix
. mat out2=J(6,6,.)

. sum BMG_WEST if syear ==2018

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
    BMG_WEST │      1,920        6500           0       6500       6500

. global sc_limit=r(mean)    

. global lm = 0.98 * $sc_limit

. **************************************************
. *(a) unrestricted (target)
. **************************************************
. preserve

. collapse (sum) weight, by(inc_soep)

. beyondpareto inc_soep [w=weight] , nrange(20, 1000) 
(sampling weights assumed)
Using the given sampling weights.
Considering the top 1000 of 1110 values, starting from the first 20 and testing 981 values for k base.

──────────┬───────────────────────────────────────────────────────────
  Results │      Ybase       gamma        S.E.  [95% Conf.   Interval]
──────────┼───────────────────────────────────────────────────────────
 inc_soep │       6200   .27431101    .0198381    .2354284    .3131936
──────────┴───────────────────────────────────────────────────────────
Optimal k base: 239

. mat out2[2,1] = e(Ybase)

. mat out2[2,2] = e(gamma)

. sum weight if inc_soep >=$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │        221    3094.013    3852.989      37.17   24529.21

. scalar w_cens=r(sum)

. sum weight if !missing(inc_soep)

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │      1,115    5086.979    8481.022          0   103471.4

. scalar w_all=r(sum)

. scalar rel_cens=w_cens/w_all

. // store shares
. beyondpareto_topshares inc_soep [w=weight], shares(0.01 0.05 0.1) matname(topshares)
(sampling weights assumed)
Using the following results for gamma: .27431101 and Ybase: 6200
Using the given sampling weights.

. mat out2[2,3] = e(topshares)[1,2]

. mat out2[2,4] = e(topshares)[2,2]

. mat out2[2,5] = e(topshares)[3,2]

. *compute share of censored (which are all part of the tail)
. beyondpareto_topshares inc_soep [w=weight], shares(`=rel_cens-0.00001') matname("cens")
(sampling weights assumed)
Using the following results for gamma: .27431101 and Ybase: 6200
Using the given sampling weights.

. mat out2[2,6] = e(cens)[1,2]    

. restore

. **************************************************
. *(b) restricted, Ybase forced to be below BBG 
. **************************************************
. preserve

. collapse (sum) weight, by(inc_soep)

. sum inc_soep if inc_soep>=0.98*$sc_limit

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
    inc_soep │        221     9149.34    3785.501       6375      41000

. local n_cens_soep=r(N)

. sum inc_soep

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
    inc_soep │      1,115    4603.961    3076.559        310      41000

. local n_total = r(N)

. local min_b = `n_cens_soep' + 20

. beyondpareto inc_soep [w=weight] , nrange(`n_cens_soep', 400)  
(sampling weights assumed)
Using the given sampling weights.
Considering the top 400 of 1110 values, starting from the first 221 and testing 180 values for k base.

──────────┬───────────────────────────────────────────────────────────
  Results │      Ybase       gamma        S.E.  [95% Conf.   Interval]
──────────┼───────────────────────────────────────────────────────────
 inc_soep │       6200   .27431101    .0198381    .2354284    .3131936
──────────┴───────────────────────────────────────────────────────────
Optimal k base: 239

. mat out2[3,1] = e(Ybase)

. mat out2[3,2] = e(gamma)

. // store shares
. beyondpareto_topshares inc_soep [w=weight], shares(0.01, 0.05, 0.10) matname("topshares")
(sampling weights assumed)
Using the following results for gamma: .27431101 and Ybase: 6200
Using the given sampling weights.

. mat out2[3,3] = e(topshares)[1,2]

. mat out2[3,4] = e(topshares)[2,2]

. mat out2[3,5] = e(topshares)[3,2]

. sum weight if inc_soep >=$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │        221    3094.013    3852.989      37.17   24529.21

. scalar w_cens=r(sum)

. sum weight if !missing(inc_soep)

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │      1,115    5086.979    8481.022          0   103471.4

. scalar w_all=r(sum)

. scalar rel_cens=w_cens/w_all

. *compute share of censored (which are all part of the tail)
. beyondpareto_topshares inc_soep [w=weight], shares(`=rel_cens-0.00001') matname("cens")
(sampling weights assumed)
Using the following results for gamma: .27431101 and Ybase: 6200
Using the given sampling weights.

. mat out2[3,6] = e(cens)[1,2]

. restore

. **************************************************
. *(c) censored (artifically) SOEP earnings
. **************************************************
. preserve

. collapse (sum) weight, by(inc_soep)

. * artificial censoring and correction of mass point (adding the weight of the censored to the last uncensored obs.)
. sum weight if inc_soep>=$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │        221    3094.013    3852.989      37.17   24529.21

. local sumwcens = r(sum)

. sum inc_soep if inc_soep<$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
    inc_soep │        894    3480.327     1378.51        310   6366.667

. replace weight = weight + `sumwcens' if inc_soep==r(max)
(1 real change made)

. replace inc_soep=. if inc_soep>=$lm
(221 real changes made, 221 to missing)

. beyondpareto inc_soep [w=weight] , nrange(15, 500)
(sampling weights assumed)
(221 missing values generated)
Using the given sampling weights.
Considering the top 500 of 889 values, starting from the first 15 and testing 486 values for k base.

──────────┬───────────────────────────────────────────────────────────
  Results │      Ybase       gamma        S.E.  [95% Conf.   Interval]
──────────┼───────────────────────────────────────────────────────────
 inc_soep │       6200   .26233101    .0691303    .1268356    .3978264
──────────┴───────────────────────────────────────────────────────────
Optimal k base: 18

. mat out2[4,1] = e(Ybase)

. mat out2[4,2] = e(gamma)

. beyondpareto_topshares inc_soep [w=weight], shares(0.01, 0.05, 0.10) matname("shares")
(sampling weights assumed)
Using the following results for gamma: .26233101 and Ybase: 6200
(221 missing values generated)
Using the given sampling weights.

. // store shares
. mat out2[4,3] = e(shares)[1,2]

. mat out2[4,4] = e(shares)[2,2]

. mat out2[4,5] = e(shares)[3,2]

. sum weight if inc_soep >=$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │        221    3094.013    3852.989      37.17   24529.21

. scalar w_cens=r(sum)

. sum weight if !missing(inc_soep)

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │        894    6344.498    24539.72          0     685677

. scalar w_all=r(sum)

. scalar rel_cens=w_cens/w_all

. *compute share of censored
. beyondpareto_topshares inc_soep [w=weight], shares(`=rel_cens') matname("cens")
(sampling weights assumed)
Using the following results for gamma: .26233101 and Ybase: 6200
(221 missing values generated)
Using the given sampling weights.

. mat out2[4,6] = e(cens)[1,2]

. restore

. ************************************************** 
.  *(d) censored ⇒ IA earnings
. **************************************************
. preserve

. collapse (sum) weight, by(inc_vskt_cens)

. * artificial censoring and correction of mass point (adding the weight of the censored to the last uncensored obs.)
. sum weight if inc_vskt_cens>=$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │          1    724466.9           .   724466.9   724466.9

. local sumwcens = r(sum)

. sum inc_vskt_cens if inc_vskt_cens<$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
inc_vskt_c~s │      1,509    3521.433    1364.214   73.58871   6367.876

. replace weight = weight + `sumwcens' if inc_vskt_cens==r(max)
(1 real change made)

. replace inc_vskt_cens=. if inc_vskt_cens>=$lm
(1 real change made, 1 to missing)

. beyondpareto inc_vskt_cens [w=weight] , nrange(20, 400)
(sampling weights assumed)
(1 missing value generated)
Using the given sampling weights.
Considering the top 400 of 1502 values, starting from the first 20 and testing 381 values for k base.

──────────────┬───────────────────────────────────────────────────────────
      Results │      Ybase       gamma        S.E.  [95% Conf.   Interval]
──────────────┼───────────────────────────────────────────────────────────
inc_vskt_cens │  6265.8335   .24398702    .0609968    .1244334    .3635406
──────────────┴───────────────────────────────────────────────────────────
Optimal k base: 20

. mat out2[5,1] = e(Ybase)

. mat out2[5,2] = e(gamma)

. beyondpareto_topshares inc_vskt_cens [w=weight], shares(0.01, 0.05, 0.10) matname("shares")
(sampling weights assumed)
Using the following results for gamma: .24398702 and Ybase: 6265.8335
(1 missing value generated)
Using the given sampling weights.

. // store shares
. mat out2[5,3] = e(shares)[1,2]

. mat out2[5,4] = e(shares)[2,2]

. mat out2[5,5] = e(shares)[3,2]

. sum weight if inc_vskt_cens >=$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │          1    724466.9           .   724466.9   724466.9

. scalar w_cens=r(sum)

. sum weight if !missing(inc_vskt_cens)

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │      1,509    3758.768    19139.36          0   732632.7

. scalar w_all=r(sum)

. scalar rel_cens=w_cens/w_all

. *compute share of censored 
. beyondpareto_topshares inc_vskt_cens [w=weight], shares(`=rel_cens') matname("cens")
(sampling weights assumed)
Using the following results for gamma: .24398702 and Ybase: 6265.8335
(1 missing value generated)
Using the given sampling weights.

. mat out2[5,6] = e(cens)[1,2]

. mat out2=out2[2..5,1..6]

. restore

. ***compute table***
. frmttable using table5.tex, statmat(out2) multicol(1,2,2;1,4,4) rtitle("Target (unrestricted, SOEP earn.)" \ "Restrict
> ed (SOEP earn.)"\"Censored (SOEP earn.)"\"Censored (IA earn.)") ctitle("", "Tail metrics" , "" , "Top earnings shares"
> , "", "", "" \"", " $ Y_{base}$", " $\hat{\gamma} $" ,"1\%","5\%","10\%", "Cens.\%") tex fr replace sdec(0,2,2,2,2,2)

   ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
                                        Tail metrics                    Top earnings shares                        
                                         $ Y_{base}$   $\hat{\gamma} $          1\%           5\%   10\%   Cens.\% 
   ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
    Target (unrestricted, SOEP earn.)      6,200            0.27               4.27          13.74  22.73   26.03  
    Restricted (SOEP earn.)                6,200            0.27               4.27          13.74  22.73   26.03  
    Censored (SOEP earn.)                  6,200            0.26               4.10          13.43  22.39   25.70  
    Censored (IA earn.)                    6,266            0.24               3.81          12.86  21.72   26.13  
   ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Table 5: Tail estimation in the SOEP-RV sample. \[\begin{array}{lcccccc} \hline & \text{Tail metrics} & & & \text{Top earnings shares} & & \\ & Y\_{base} & \hat{\gamma} & 1\% & 5\% & 10\% & \text{at cens.}\%\\\text{Target (unrestricted, SOEP earn.)} & 6,200 & 0.27 & 4.27 & 13.74 & 22.73 & 26.03\\ \text{Restricted (SOEP earn.)} & 6,200 & 0.27 & 4.27 & 13.74 & 22.73 & 26.03\\ \text{Censored (SOEP earn.)} & 6,200 & 0.26 & 4.10 & 13.43 & 22.39 & 25.70\\ \text{Censored (IA earn.)} & 6,266 & 0.24 & 3.81 & 12.86 & 21.72 & 26.13\\ \hline\end{array}\]

Bottom line: Our estimation approach enables us to provide a credible estimate for the right tail of the earnings distribution based on censored administrative earnings. Figure 6 visualizes the setting and the results: The histogram (brown bars) depicts the earnings density below the administrative censoring ceiling alongside the mass point of the latter (transparent brown bar). We then smoothly paste a Pareto tail (blue line) based on our estimate of \(\hat{\gamma}= 0.24\).

Code

. use ${temp}soep_rv_west.dta, clear
(SOEP-Core, v38.1 (EU Edition), doi:10.5684/soep.core.v38.1eu)

. // only use 2018
. keep if JAHR ==2018 
(0 observations deleted)

. keep if sex==1
(1,997 observations deleted)

. gen weight= phrf

. sum BMG_WEST if syear ==2018

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
    BMG_WEST │      1,920        6500           0       6500       6500

. global sc_limit=r(mean)    

. global lm = 0.98 * $sc_limit

. preserve 

. keep inc_vskt_cens weight 

. keep if inc_vskt_cens >= $lm
(1,620 observations deleted)

. save above_bbg, replace
file above_bbg.dta saved

. restore

. collapse (sum) weight if inc_vskt_cens < $lm, by(inc_vskt_cens)

. append using above_bbg

. preserve

. collapse (sum) weight, by(inc_vskt_cens)

. * artificial censoring and correction of mass point (adding the weight of the censored to the last uncensored obs.)
. sum weight if inc_vskt_cens>=$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
      weight │          1    724466.9           .   724466.9   724466.9

. local sumwcens = r(sum)

. sum inc_vskt_cens if inc_vskt_cens<$lm

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
inc_vskt_c~s │      1,509    3521.433    1364.214   73.58871   6367.876

. replace weight = weight + `sumwcens' if inc_vskt_cens==r(max)
(1 real change made)

. replace inc_vskt_cens=. if inc_vskt_cens>=$lm
(1 real change made, 1 to missing)

. beyondpareto inc_vskt_cens [w=weight] , nrange(20, 400)
(sampling weights assumed)
(1 missing value generated)
Using the given sampling weights.
Considering the top 400 of 1502 values, starting from the first 20 and testing 381 values for k base.

──────────────┬───────────────────────────────────────────────────────────
      Results │      Ybase       gamma        S.E.  [95% Conf.   Interval]
──────────────┼───────────────────────────────────────────────────────────
inc_vskt_cens │  6265.8335   .24398702    .0609968    .1244334    .3635406
──────────────┴───────────────────────────────────────────────────────────
Optimal k base: 20

. scalar sc_alpha = 1/e(gamma)

. restore

. ******************************************************
. *uncensored bins
. ******************************************************
. local num_bins = 10

. sum inc_vskt_cens

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
inc_vskt_c~s │      1,809    3993.833    1635.663   73.58871       6370

. local mininc = r(min)

. local bin_width = ($lm-`mininc')/`num_bins'

. local lm_minus_bin = $lm-`bin_width'

. gen bin_min =.
(1,809 missing values generated)

. gen bin_max =.
(1,809 missing values generated)

. gen bin_nr = .
(1,809 missing values generated)

. gen bin_height = .
(1,809 missing values generated)

. local cnt = 1

. forvalues x=`mininc'(`bin_width')$lm {
  2. *`lm_minus_bin' {
. qui replace bin_min =`x' if _n ==`cnt'
  3. qui replace bin_max =`x'+`bin_width' if _n ==`cnt'
  4. qui replace bin_nr =`cnt' if _n ==`cnt'
  5. qui sum inc_vskt_cens [w=weight] if inc_vskt_cens>=`x' & inc_vskt_cens<=`x'+`bin_width' & inc_vskt_cens<$lm
  6. qui replace bin_height = r(sum_w) if _n ==`cnt'
  7. local cnt = `cnt'+1
  8. }

. gen bin_mean = (bin_max+bin_min)/2
(1,799 missing values generated)

. sum inc_vskt_cens [w=weight]
(analytic weights assumed)

    Variable │     Obs      Weight        Mean   Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────────────
inc_vskt_c~s │   1,801  5671981.45    3864.606    1570.02   73.58871       6370

. gen frequency = bin_height/(r(sum_w)*`bin_width')
(1,799 missing values generated)

. ********************************************************
. * smooth tail
. scalar sc_alpha = 1/e(gamma)

. * we only want the part of the tail starting at the bbg, not from Ybase on 
. scalar sc_xm = $lm

. gen test = _n*15 + $lm

. replace test = . if test > 30000
(234 real changes made, 234 to missing)

. sum inc_vskt_cens [w=weight] if inc_vskt_cens>=$lm
(analytic weights assumed)

    Variable │     Obs      Weight        Mean   Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────────────
inc_vskt_c~s │     299  724466.873        6370          0       6370       6370

. local num_cens = r(sum_w)

. sum inc_vskt_cens [w=weight] 
(analytic weights assumed)

    Variable │     Obs      Weight        Mean   Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────────────
inc_vskt_c~s │   1,801  5671981.45    3864.606    1570.02   73.58871       6370

. local num_total = r(sum_w)

. gen tail = (`num_cens'/`num_total')*(sc_alpha*((sc_xm)^sc_alpha)/((test)^(sc_alpha+1)))
(234 missing values generated)

. global bin_width = `bin_width'

. * mass point
. ********************************************************
. gen masspoint_mean = .
(1,809 missing values generated)

. replace masspoint_mean = $lm+`bin_width'/2 if _n == 1
(1 real change made)

. gen mfrequency = .
(1,809 missing values generated)

. replace mfrequency =`num_cens'/(`num_total'* `bin_width') if _n ==1
(1 real change made)

. * plot only uncensored , mass point and tail
. tw (bar frequency bin_mean, barwidth($bin_width) color("sienna"))  (bar mfrequency masspoint_mean, barwidth($bin_width
>  )  color(sienna%20) ) (scatter tail test, connect(direct)  msize(0.001pt) mcolor("none") lcolor("navy")), legend(orde
> r(1 "IA earnings" 2 "IA earnings (censored)" 3 "Estimated Pareto tail") ring(0) bplace(1) rows(3)) xtitle("Monthly ear
> nings") ytitle("Relative frequency") 

. qui graph export hist_tail_mass.png, replace

Figure 6: IA earnings distribution: censored and fitted tail

4.3 Tobit imputation

A popular way of dealing with top-censored earnings in German administrative data is to impute based on individual-level Tobit regressions. Figure 7 compares the resulting empirical distribution function (EDF) for Tobit-imputed earnings to the target EDF of uncensored SOEP earnings.

Figure 7: IA earnings distribution: censored and fitted tail

Figure 8 compares the implied top earnings shares with the target shares based on the SOEP sample. For maximum comparability, we impute for artificially censored SOEP earnings, to prevent deviations between target and imputed shares due to differing earnings concepts.

Figure 8: Comparison of top earnings shares based on different imputation methods

Bottom line: For the vast majority of top earnings shares, beyondpareto’s estimates are closer to the target than the empirical shares of the Tobit imputation. Generally, the beyondpareto procedure slightly under-predicts, while the Tobit imputation over-predicts. Only in the extreme part of the tail do all three shares become, by necessity, very close.

6 Conclusions

The presented distributional approach should not be confused with individual-level imputations of censored data. The latter requires, in addition to the Pareto parameter, the assignment of a rank of to each censored observation. Due to the censoring, these ranks are unobserved. In the absence of a rank assignment of the censored observations, the calculation of top shares within the estimated tail presents the preferable option to assess the distribution of high earnings.

For further details, please consult the paper: https://doi.org/10.1515/jbnst-2024-0037