Skip to contents
  • For beyondpareto’s full set of options, see the reference page
  • All examples were produced using Stata 18. Earlier versions may produce different random numbers.

Fitting a joint Pareto and Log-Normal Distribution

A popular parametric model for income, wealth, and city size data consists in assuming that a Pareto upper tail is smoothly pasted on a log-normal body (a so-called lognormal-Pareto model).

The data generating process (DGP) is as follows: We draw a synthetic dataset of 3,000 observations following a log-normal distribution. The mean of the underlying normal variate is 5 and the standard deviation is 2. An additional 2,000 observations populate the upper tail and follow a Pareto distribution, i.e.  \[ F(y)= 1- \left( y / y_{\mathrm {m}} \right) ^{-1/ \gamma } \] for \(y\geq y_{\mathrm {m}}\) and 0 otherwise, with \(\gamma= 1 / 0.85 =1.176\) and threshold \(y_{\mathrm{m}}\)=428,789. Hence, by construction, the threshold value that marks the beginning of the Pareto tail is known, such that the evaluation of the performance of beyondpareto is straightforward.

quietly {
 set seed 330033
 set obs 5000
 gen wealth = exp(rnormal(5,2))
 sum wealth
 local max_l = r(max)
 sort wealth
 replace wealth=. if _n>=3000
 gen un=runiform() if wealth==.
 sum wealth
 local max_l = r(max)
 replace wealth=((1-un)^(-1/0.85))*`max_l' if wealth==.
 drop un
 gen weight=1
}

beyondpareto wealth [w=weight], nrange(10,5000) rho(-0.5) draw_plots
qui graph export pics/QQ1.png, replace
(sampling weights assumed)
Using the given sampling weights.
Considering the top 5000 of 5000 values, starting from the first 10 and testing 4991 values for k base.

----------------------------------------------------------------------
  Results |      Ybase       gamma        S.E.  [95% Conf.   Interval]
----------+-----------------------------------------------------------
   wealth |  240.80734   1.1864621     .029595    1.128456    1.244468
----------------------------------------------------------------------
Optimal k base: 2009

The Pareto QQ-plot becomes eventually linear, but it is not immediately apparent from visual inspection at which precise relative rank that is. The selected threshold of k\(^{\ast}\)=2,009 is very close to actual threshold value of 2,000. The 95% confidence interval of the estimated shape parameter (1.128; 1.245) is tight and contains the underlying true value of 1.176.

Fitting a Burr Distribution

Consider next the Burr distribution \[ 1-F_{(\gamma,\rho)}(y)=(1+y^{-\rho / \gamma})^{1/\rho} \] with \(\gamma >0\) and \(\rho <0\) (the latter being in fact the second-order parameter of regular variation). In the inequality literature it is also known as the Singh-Maddala distribution . For large \(y\) the distribution can be expanded as \(y^{-1/\gamma} [1 + (1/\rho) y ^{\gamma/\rho}]\), which reveals the Burr distribution to be a member of the Hall class. Its tail quantile is \(U(y)=y^\gamma [1 + (\gamma/\rho) y^\rho + o(y^\rho)]\).

The study of the estimator of \(\gamma\) in the Burr case is instructive of the empirical challenges given by a slow convergence to the Pareto limit, parameterized here by \(\rho\). We have generated the data with \(\rho=-2\), so that tail convergence to a Paretian distribution is rather fast.

quietly {
  clear
  set seed 330033
  local gam=0.6
  local rho=-2
  set obs 20000
  gen un=runiform()
  gen wealth=(un^`rho' - 1)^(-`gam'/`rho')
  gen weight=1
}

beyondpareto wealth [w=weight], nrange(10,10000) rho(-0.5)

(sampling weights assumed)
Using the given sampling weights.
Considering the top 10000 of 20000 values, starting from the first 10 and testing 9991 values for k base.

----------------------------------------------------------------------
  Results |      Ybase       gamma        S.E.  [95% Conf.   Interval]
----------+-----------------------------------------------------------
   wealth |  2.8018258   .60229768    .0113035    .5801428    .6244525
----------------------------------------------------------------------
Optimal k base: 3549

Top-censoring in income distribution (GB2 model)

The GB2 distribution provides a very good fit for local earning distributions and we use this example to demonstrate how to handle censored data as it often appears in income distributions from e.g., administrative data. We generate a dataset based on the GB2 distribution using the parameter estimates as given in Schluter and Trede (2023).

quietly {
  clear
  set seed 330033
  local b = 5.18 /* population gamma: 1 / (b * q) */
  local a = 32754
  local p = 0.518
  local q= 0.509
  set obs 140000
  gen double income = `a'*( (1/invibeta(`p',`q',runiform()))-1 )^(-1/`b')
  gsort - income
  gen weight = 1
}

beyondpareto income, fracrange(.0001,.2) rho(-0.5)  

No sampling weights given. Using w=1.
Considering the top 28000 of 140000 values, starting from the first 14 and testing 27987 values for k base.

----------------------------------------------------------------------
  Results |      Ybase       gamma        S.E.  [95% Conf.   Interval]
----------+-----------------------------------------------------------
   income |  57273.714   .38292603    .0029869    .3770718    .3887803
----------------------------------------------------------------------
Optimal k base: 20545

Next, we investigate the effect of top-censoring on the estimator, imposing a censoring incidence of 12% (a typical value that is used in the German administrative SIAB data). As the distribution now has a mass point at the censoring threshold, we change the weight of one such worker and drop the remaining censored observations.

quietly {
  clear
  set seed 330033
  local b = 5.18 /* population gamma: 1 / (b * q) */
  local a = 32754
  local p = 0.518
  local q= 0.509
  set obs 140000
  gen double income = `a'*( (1/invibeta(`p',`q',runiform()))-1 )^(-1/`b')
  gsort - income
  gen weight = 1
  
  local n_cens = (.12 * _N)
  drop if _n < `n_cens' /* 12% censoring & dropping */
  replace weight = weight + `n_cens' if _n==1
}

beyondpareto income [w=weight], fracrange(.003,.2) rho(-0.5) draw_plots
qui graph export pics/QQ2.png, replace

(sampling weights assumed)
Using the given sampling weights.
Considering the top 24640 of 123201 values, starting from the first 370 and testing 24271 values for k base.

----------------------------------------------------------------------
  Results |      Ybase       gamma        S.E.  [95% Conf.   Interval]
----------+-----------------------------------------------------------
   income |  61107.893   .36810402    .0174224    .3339561     .402252
----------------------------------------------------------------------
Optimal k base: 558

Despite such a large censoring incidence, beyondpareto performs well in estimating the rank-size regressor. However, if the censoring problem is not properly addressed (by either ignoring it or dropping all censored observations), the estimator will be biased towards zero.

Fitting German City Sizes

Now we apply beyondpareto to German city size data. The distribution of city-sizes is generally also not strictly Pareto, but rather Pareto-like.

quietly {
  clear
  import excel using "https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/Administrativ/Archiv/GVAuszugJ/31122000_Auszug_GV.xlsx?__blob=publicationFile", sheet(Gemeindedaten) cellrange(J8:J16155)

  rename J citysize
  drop if citysize == .
  keep if citysize > 1
  gen weight = 1
  gsort - citysize
}

beyondpareto citysize [w=weight], fracrange(0.001, 0.5) rho(-0.5) draw_plots
qui graph export pics/QQ3.png, replace

(sampling weights assumed)
Using the given sampling weights.
Considering the top 6919 of 13837 values, starting from the first 14 and testing 6906 values for k base.

----------------------------------------------------------------------
  Results |      Ybase       gamma        S.E.  [95% Conf.   Interval]
----------+-----------------------------------------------------------
 citysize |      16042   .76189049    .0283468    .7063308    .8174502
----------------------------------------------------------------------
Optimal k base: 903

We obtain a value of \(\gamma=0.761\), which is the same as the estimate in Schluter (2021). As we can see from the diagnostic plots, this value appears very sensible as the plot for gamma is very flat and stable. Then it becomes optimal to choose a value that minimizes the variance of the estimate of \(\gamma\).