Examples
examples.Rmd
- For
beyondpareto
’s full set of options, see the reference page - All examples were produced using
Stata
18 (see session information below). Earlier Stata versions may produce different random numbers.
Fitting a joint Pareto and Log-Normal Distribution
A popular parametric model for income, wealth, and city size data consists in assuming that a Pareto upper tail is smoothly pasted on a log-normal body (a so-called lognormal-Pareto model).
The data generating process (DGP) is as follows: We draw a synthetic
dataset of 3,000 observations following a log-normal distribution. The
mean of the underlying normal variate is 5 and the standard deviation is
2. An additional 2,000 observations populate the upper tail and follow a
Pareto distribution, i.e. \[
F(y)= 1- \left( y / y_{\mathrm {m}} \right) ^{-1/ \gamma }
\] for \(y\geq y_{\mathrm {m}}\)
and 0 otherwise, with \(\gamma= 1 / 0.85
=1.176\) and threshold \(y_{\mathrm{m}}\)=242.51. Hence, by
construction, the threshold value that marks the beginning of the Pareto
tail is known, such that the evaluation of the performance of
beyondpareto
is straightforward.
quietly {
set seed 330033
set obs 5000
gen wealth = exp(rnormal(5,2))
sum wealth
local max_l = r(max)
sort wealth
replace wealth=. if _n>=3000
gen un=runiform() if wealth==.
sum wealth
local max_l = r(max)
replace wealth=((1-un)^(-1/0.85))*`max_l' if wealth==.
drop un
gen weight=1
}
beyondpareto wealth [w=weight], nrange(10,5000) rho(-0.5) plot(all)
qui graph export pics/QQ1.png, replace
(sampling weights assumed)
Using the given sampling weights.
Considering the top 5000 of 5000 values, starting from the first 10 and testing 4991 value
> s for k base.
----------------------------------------------------------------------
Results | Ybase gamma S.E. [95% Conf. Interval]
----------+-----------------------------------------------------------
wealth | 240.80734 1.1864621 .029595 1.128456 1.244468
----------------------------------------------------------------------
Optimal k base: 2009
The Pareto QQ-plot becomes eventually linear, but it is not immediately apparent from visual inspection at which precise relative rank that is. The selected threshold of k\(^{\ast}\)=2,009 is very close to actual threshold value of 2,000. The 95% confidence interval of the estimated shape parameter (1.128; 1.245) is tight and contains the underlying true value of 1.176.
Fitting a Burr Distribution
Consider next the Burr distribution \[ 1-F_{(\gamma,\rho)}(y)=(1+y^{-\rho / \gamma})^{1/\rho} \] with \(\gamma >0\) and \(\rho <0\) (the latter being in fact the second-order parameter of regular variation). In the inequality literature it is also known as the Singh-Maddala distribution . For large \(y\) the distribution can be expanded as \(y^{-1/\gamma} [1 + (1/\rho) y ^{\gamma/\rho}]\), which reveals the Burr distribution to be a member of the Hall class. Its tail quantile is \(U(y)=y^\gamma [1 + (\gamma/\rho) y^\rho + o(y^\rho)]\).
The study of the estimator of \(\gamma\) in the Burr case is instructive of the empirical challenges given by a slow convergence to the Pareto limit, parameterized here by \(\rho\). We have generated the data with \(\rho=-2\), so that tail convergence to a Paretian distribution is rather fast.
quietly {
clear
set seed 330033
local gam=0.6
local rho=-2
set obs 20000
gen un=runiform()
gen wealth=(un^`rho' - 1)^(-`gam'/`rho')
gen weight=1
}
beyondpareto wealth [w=weight], nrange(10,10000) rho(-0.5)
(sampling weights assumed)
Using the given sampling weights.
Considering the top 10000 of 20000 values, starting from the first 10 and testing 9991 val
> ues for k base.
----------------------------------------------------------------------
Results | Ybase gamma S.E. [95% Conf. Interval]
----------+-----------------------------------------------------------
wealth | 2.8018258 .60229768 .0113035 .5801428 .6244525
----------------------------------------------------------------------
Optimal k base: 3549
Top-censoring in income distribution (GB2 model)
The GB2 distribution provides a very good fit for local earning distributions and we use this example to demonstrate how to handle censored data as it often appears in income distributions from e.g., administrative data. We generate a dataset based on the GB2 distribution using the parameter estimates as given in Schluter and Trede (2023).
quietly {
clear
set seed 330033
local b = 5.18 /* population gamma: 1 / (b * q) */
local a = 32754
local p = 0.518
local q= 0.509
set obs 140000
gen double income = `a'*( (1/invibeta(`p',`q',runiform()))-1 )^(-1/`b')
gsort - income
gen weight = 1
}
beyondpareto income, fracrange(.0001,.2) rho(-0.5)
No sampling weights given. Using w=1.
Considering the top 28000 of 140000 values, starting from the first 14 and testing 27987 v
> alues for k base.
----------------------------------------------------------------------
Results | Ybase gamma S.E. [95% Conf. Interval]
----------+-----------------------------------------------------------
income | 57273.714 .38292603 .0029869 .3770718 .3887803
----------------------------------------------------------------------
Optimal k base: 20545
Next, we investigate the effect of top-censoring on the estimator, imposing a censoring incidence of 12% (a typical value that is used in the German administrative SIAB data). As the distribution now has a mass point at the censoring threshold, we change the weight of one such worker and drop the remaining censored observations.
quietly {
clear
set seed 330033
local b = 5.18 /* population gamma: 1 / (b * q) */
local a = 32754
local p = 0.518
local q= 0.509
set obs 140000
gen double income = `a'*( (1/invibeta(`p',`q',runiform()))-1 )^(-1/`b')
gsort - income
gen weight = 1
local n_cens = (.12 * _N)
drop if _n < `n_cens' /* 12% censoring & dropping */
replace weight = weight + `n_cens' if _n==1
}
beyondpareto income [w=weight], fracrange(.003,.2) rho(-0.5) plot(all)
qui graph export pics/QQ2.png, replace
(sampling weights assumed)
Using the given sampling weights.
Considering the top 24640 of 123201 values, starting from the first 370 and testing 24271
> values for k base.
----------------------------------------------------------------------
Results | Ybase gamma S.E. [95% Conf. Interval]
----------+-----------------------------------------------------------
income | 61107.893 .36810402 .0174224 .3339561 .402252
----------------------------------------------------------------------
Optimal k base: 558
Despite such a large censoring incidence, beyondpareto
performs well in estimating the rank-size regressor. However, if the
censoring problem is not properly addressed (by either ignoring it or
dropping all censored observations), the estimator will be biased
towards zero.
Fitting German City Sizes
Now we apply beyondpareto to German city size data. The distribution of city-sizes is generally also not strictly Pareto, but rather Pareto-like.
quietly {
clear
import excel using "https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/Administrativ/Archiv/GVAuszugJ/31122000_Auszug_GV.xlsx?__blob=publicationFile", sheet(Gemeindedaten) cellrange(J8:J16155)
rename J citysize
drop if citysize == .
keep if citysize > 1
gen weight = 1
gsort - citysize
}
beyondpareto citysize [w=weight], fracrange(0.001, 0.5) rho(-0.5) plot(all)
qui graph export pics/QQ3.png, replace
(sampling weights assumed)
Using the given sampling weights.
Considering the top 6919 of 13837 values, starting from the first 14 and testing 6906 valu
> es for k base.
----------------------------------------------------------------------
Results | Ybase gamma S.E. [95% Conf. Interval]
----------+-----------------------------------------------------------
citysize | 16042 .76189049 .0283468 .7063308 .8174502
----------------------------------------------------------------------
Optimal k base: 903
We obtain a value of \(\gamma=0.761\), which is the same as the estimate in Schluter (2021). As we can see from the diagnostic plots, this value appears very sensible as the plot for gamma is very flat and stable. Then it becomes optimal to choose a value that minimizes the variance of the estimate of \(\gamma\).
Version control and session information
which beyondpareto
*! beyondpareto.ado, version 1.0, 2024-06-14
*! Christian Schluter and Isabella Retter
*! To learn more about beyondpareto see the vignette at
*! https://christianschluter.github.io/beyondpareto/
display "STATA version: `c(stata_version)' running on `c(os)' `c(osdtl)' "
STATA version: 18 running on MacOSX 14.1.0