A simulation study of kernel density estimation and histogram
- 1 Introduction
- 2 Methods
- Histograms
- Kernel density estimation
- Cross-validation
- 3 Data generating processes and preliminary experiments
- Simulation processes
- Results
- 4 Monte Carlo simulation study
- 5 Conclusion
- References
看来我擅长把给了半个月的作业三天弄完。虽然质量低迷还不严谨,但至少看起来还好看啊!我愿称R Markdown为最佳文档编辑格式(是的LaTeX还是不会)。简单高效,代码输出好排版,还有bookdown这种好东西。还有谁!学!都去学!(R Markdown Cookbook)
1 Introduction
To estimate distributions from data, the parametric way is to hypothesize a parametric family and estimate the parameters from the samples. However, for data with complex structures, it is often difficult to obtain any assumptions about the parameter family of the underlying density function, which requires more general approaches. Nonparametric density estimation involves estimating densities with as few distributional assumptions as possible, with the histogram being the simplest. Kernel density estimation (KDE) is another nonparametric method that gives smooth estimators compared with the histogram. This report compares the performance of the kernel density estimator with that of the histogram on univariate random variables.
The rest of this report is structured as follows. Section 2 introduces the basic idea of histogram and kernel density estimation as well as cross-validation for parameter selection. Then Section 3 describes the data generating processes and gives a first impression of the comparison through a one-shot experiment. Section 4 performs the Monte Carlo simulation and explains the results. Section Section 5 is the conclusion and discussion.
2 Methods
Let \(f\) be a probability density of a distribution and let
\[X_1, \dots, X_n \stackrel{\rm iid}{\sim} f, \label{1} \tag{1} \]be an sample of size \(n\) from the distribution. In this report, we concentrate on univariate cases, that is \(X_i \in \mathbb{R}\) for \(i = 1, \dots n\). The goal is to obtain the estimate of the density \(f\), denoted by \(\hat{f}_n\).
Histograms
Histograms seem to be a natural way to estimate the density. The histogram divides the range of the sample into different bins. Usually we make the bins the same width, which is defined as binwidth. The histogram estimator for a value \(x_i\) in the i\(^{\text{th}}\) bin is then
\[\hat{f}_n(x_i) = \frac{\text{number of observations in the i}^{\text{th}} \text{ bin}}{\text{binwidth} * \text{sample size}}. \label{2} \tag{2} \]Frequency polygons displaying the counts with lines are used in this report for direct comparison.
Kernel density estimation
A smooth density function ("kernel") is used to fit the observations for kernel density estimation. Given a kernel \(K\) and a positive number (called the bandwidth) \(h\), the kernel density estimator is
\[\hat{f}_n(x) = \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{h} K \left( \frac{x - X_i}{h} \right). \label{3} \tag{3} \]In this report, Gaussian kernel (i.e. \(K(x) = \phi(x)\), where \(\phi(x)\) is the standard normal density function) is used.
Cross-validation
The integrated squared error (ISE) can be used as a loss function to measure the difference between the estimated density and the true density, which is defined as
\[L = \int \left( \hat{f}_n(x) - f(x) \right)^2 \text{d}x. \label{4} \tag{4} \]The quality of the estimator \(\hat{f}_n\) is assessed in terms of Integrated Mean Squared Error (IMSE), also known as risk
\[\begin{aligned} R &= \text{E}(L) = \int \text{E} \left[ \left( \hat{f}_n(x) - f(x) \right)^2 \right] \text{d}x \\\\ &= \int \left( \text{E} \left[ \hat{f}_n(x) \right] - f(x) \right)^2 \text{d}x + \int \left( \text{E} \left[ \left( \hat{f}_n(x) - \text{E} \left[ \hat{f}_n(x) \right] \right)^2 \right] \right) \text{d}x \\\\ &= \int \text{Bias}^2 \left[ \hat{f}_n(x) \right] \text{d}x + \int \text{Var} \left[ \hat{f}_n(x) \right] \text{d}x, \end{aligned} \label{5} \tag{5} \]where \(\text{Bias} \left[ \hat{f}_n(x) \right]\) and \(\text{Var} \left[ \hat{f}_n(x) \right]\) are respectively the bias and variance of \(\hat{f}_n(x)\) at a fixed \(x\). Thus, the integrals are the total squared bias and the total variance of the sample. In general, we would like to minimize the risk.
The smoothness of the estimate \(\hat{f}_n(x)\) is controlled by some parameter. The smoothing parameter is the number of bins or the binwidth for the histograms, while the bandwidth \(h\) for the kernel density estimators. Too smooth estimators will give small sampling variance but large bias, which is underfitting. On the other hand, undersmoothed estimators have small bias but large sampling variance, which is overfitting. To minimize the risk, a trade-off between bias and variance needs to be balanced.
A common method of selecting smoothing parameters to minimise risk is through cross-validation, and the leave-one-out method is used in this report.
Without the response variable, the cross-validation estimator is different from that of the regression case. The cross-validation estimator of risk[1] is
\[\hat{J} (h) = \int \left( \hat{f}_n (x) \right)^2 \text{d}x - \frac{2}{n} \sum_{i = 1}^{n} \hat{f}_{(-i)} (X_i), \label{6} \tag{6} \]where \(\hat{f}_{(-i)}\) is the estimator from the sample without the \(i^{\text{th}}\) observation (Wasserman, 2006, ch. 6).
3 Data generating processes and preliminary experiments
Simulation processes
Samples from four different distributions are generated. A one-shot experiment is performed first. For each sample, we get the histogram estimator and Gaussian kernel density estimator with smooth parameters chosen by cross-validation. The sample size is chosen to be 250 here. The four distributions, used in the Monte Carlo simulation study in Section 4 as well, are
-
standard normal distribution: \(N(0, 1)\);
-
exponential distribution with rate 1: \(\exp(1)\);
-
chi-squared distribution with 3 degrees of freedom: \(\chi^2(3)\);
-
Claw density (Marron & Wand, 1992) or, in Wasserman (2006)'s words, Bart Simpson density, which is a normal mixing density:
\[f = \frac{1}{2} N(0, 1) + \frac{1}{10} \sum_{m = 0}^4 N \left( \frac{m}{2}-1, \frac{1}{10} \right). \label{7} \tag{7} \]
Results
The estimated densities against the true densities of these four distributions are shown in Figure 1 through 4. The kernel density estimation produces smoothed estimators compared to the histogram. It seems that the estimated densities capture the essential characteristics of true distribution, though most estimators seem to be undersmoothed a little. In the case of known true densities, it can be seen that the estimators of the smoothing parameters chosen by cross-validation capture more variation in the data than is needed. However, when it comes to the Bart Simpson density example in Figure 4, the estimators appear to be properly smoothed for the "wave" part of the density. In Figure 3, the histogram estimator gives a density with lower noise, which is closer to the true density. Alternative choices of smoothing parameters can be considered to obtain the best estimate depending on the situation.
It is worth noting that in the exponential example in Figure 2, the left border of the true density goes to infinity, while the estimates reach the peak and then fall to zero. The problem happens when the support of the true density has boundaries. High bias and variation for the boundary points is a major problem for non-parametric curve estimators (Gasser & Müller, 1979; Hominal & Deheuvels, 1979). Some possible methods to improve can be found in studies by Gasser & Müller (1979), Müller (1984) and Cowling & Hall (1996).
Despite being a one-shot experiment, some conclusions can still be drawn. The main advantage of kernel density estimation over histograms is its smoothness. With information about the neighbourhood of a point in the sample, the problem of discontinuities is greatly improved. The problem with both is the choice of smoothing parameters, to which the final estimate is sensitive. The bias in the estimation of boundary points for finite population samples is also an issue for both.
4 Monte Carlo simulation study
The same simulation processes are replicated 200 times. Then we get 200 outcomes for all the combinations of both estimation methods and the four densities respectively. The ISE is calculated for each estimate. And In addition to the sample size of 250, the sample sizes of 500 and 1000 are also attempted. Figure 5 demonstrates the boxplots of ISE for different distributions and different estimation methods.
When the sample size increase, the ISE is expected to be smaller for all case. In addition to this, the size of the boxes also suggests that the variance between simulations has become smaller for larger samples size. Estimates for large samples will be more stable and closer to reality.
The ISE of histogram estimates is expected to be larger than that of the kernel density estimates for the standard normal and Bart Simpson densities, and smaller for the exponential and chi-squared densities. Different smoothing parameter chosen approaches may have different results. The variation of ISE between simulations is relatively larger for histograms than kernel density estimation.
5 Conclusion
This report compares two non-parametric estimation methods of the density, histogram and Gaussian kernel density estimation, through simulation. Kernel density estimators give smoothed estimates, whereas the histograms do not. Both methods require attention to the choice of smoothing parameters and the boundary points of the sample. Estimates from larger samples will have results more stable and closer to the true density. The kernel density estimation is more stable compared with the histogram.
References
Boos, D. D., & Stefanski, L. A. (2013). Monte Carlo Simulation Studies. In Essential Statistical Inference (pp. 363-383). Springer, New York, NY.
Cowling, A., & Hall, P. (1996). On pseudodata methods for removing boundary effects in kernel density estimation. Journal of the Royal Statistical Society: Series B (Methodological), 58(3), 551-563.
Deng, H., & Wickham, H. (2011). Density estimation in R. Electronic publication.
Gasser, T., & Müller, H. G. (1979). Kernel estimation of regression functions. In Smoothing techniques for curve estimation (pp. 23-68). Springer, Berlin, Heidelberg.
Marron, J. S., & Wand, M. P. (1992). Exact mean integrated squared error. The Annals of Statistics, 20(2), 712-736.
Müller, H. G. (1984). Boundary effects in nonparametric curve estimation models. In Compstat 1984 (pp. 84-89). Physica, Heidelberg.
Wasserman, L. (2006). All of nonparametric statistics. Springer Science & Business Media.
The equation is from the definition of ISE, with the constant term removed. See Wasserman (2006), ch. 6 for more details. ??