### Call title (Call ID)

Faculty Development Competitive Research Grant Program 2018-2020

### Project Description

In this project I investigate the possibility of efficiency gain in the panel data models with survey data. Under a generalized method of moment (GMM) framework, I will show that by assimilating the information within the panels in each stratum that is ignored in pooled estimates, finding more efficient estimators is possible. Like generalized estimating equations (GEE), we are able to consider specific forms for correlation for the panels in each stratum. Monte Carlo results confirm that the new GMM estimators called weighted and unweighted GLS are more efficient than their competitors, i.e., pooled ordinary least squares (POLS) and weighted POLS. In case of endogenous stratification, weighted generalized least squares (GLS) and in case of exogenous stratification unweighted GLS do better than the rest. I applied the findings to study determinants of income inequality in the United States using data from the Panel Study of Income Dynamic (PSID) that is a well-known survey. I will show that new estimators are more efficient in compare with POLS or weighted POLS methods.

In many empirical studies in economics, political science, sociology and other branches of social sciences surveys are standard tools for collecting data. One important aspect of survey data is that observations are not distributed identically and independently anymore. Meanwhile, many methods of estimations like ordinary least squares or maximum likelihood rely on the assumption that the observations are identically and independently distributed (iid) that is violated in survey studies.

We know how by appropriate weighting to fix inconsistency problem when dealing with survey data. However, Finding more efficient estimators helps researchers to increase the precision of their statistical inferences. Efficiency usually comes at a price. It requires stronger assumptions needed for consistency. Here, we assume that the explanatory variables are strictly exogenous. However, in panel data studies, this assumption is violated in models with lagged dependent variables and perhaps in models without lagged dependent variables. Fixed effects (FE) and random effects (RE) are two well-known linear methods used in empirical studies that require strict exogeneity of the estimators. In RE approach, the serial correlation in the composite error is exploited in a generalized least squares (GLS) framework. In GLS procedure, we also need to add assumptions to the conditional variance matrix of the error term.

Finding efficient estimators in models with stratified data is more challenging. The issue has been the subject of interest already. Among others, Cosslett (1981a, 1081b, and 1993), and Imbens (1992) examine the efficiency for discrete choice models. Imbens and Lancaster (1996) develop a new estimator for this estimation problem and show that it achieves the semi-parametric efficiency bound in this case. Recently Tripathi (2011) has developed efficient empirical likelihood-based inference for moment restriction models when data are collected by stratified sampling schemes. In this paper, the efficiency problem is studied in the panel data models with stratified data. The main idea is to use the information within panels similar to the GLS method as a means to gain efficiency. It should be emphasized that we are only approximate the efficient estimator in the sample and try to obtain more efficient estimates compare with pooled estimators that ignore correlations within panels. In other words, our goal in this paper is not to find efficiency bounds of any kind.

The main effect of the clustering and stratified sampling methods is that the data are not identically and independently distributed (iid) anymore. Therefore, assuming iid may cause serious problems.

Three widely known stratified sampling schemes are standard stratified sampling (SS sampling), multinomial sampling, and variable probability sampling (VP sampling). In SS sampling, the population is divided into several subpopulations based on factors like income, race, gender, education, area of residence, etc. Then, a random sample is taken within each subpopulation or stratum independently. The result is a sample of independent, but not identically distributed observations. It should be emphasized that, unlike simple random sampling, in SS sampling the proportions of observations within strata do not reflect population proportions as they would if the sample were selected randomly from the population at large.

A multinomial sampling scheme is similar with SS sampling. The difference is that in multinomial sampling, first a stratum is chosen randomly and then samples randomly from the stratum. Although this kind of sampling is not common in practice, but theoretically it is easier to deal with because it produces iid observations (see Wooldridge 1999).

In VP sampling, which also is known as the Bernoulli, sampling in the literature, first an observation is drawn randomly from the population, then the researcher determines its stratum. After determining the observation's stratum, it will be kept in the sample with specific probability that is set by the researcher as well. If an observation is discarded its values are not recorded.

SS sampling scheme is often used when observations from each stratum are easily identified before sampling. On the other hand, if for example one of the variable of interest is family income that is difficult to be determined before sampling, VP sampling design is more suitable.

Stratification can be done in terms of exogenous variables, endogenous variables or both. Strata are determined after choosing an econometric model. Stratification based on exogenous variables of the chosen model does not produce serious problem; one can ignore it and still get consistent estimates for parameters of interest. See, for example, DuMouchel and Duncan (1983) and Manski and McFadden (1981). On the other hand, ignoring stratification raises serious issues in case of endogenous stratification. Endogenous stratification causes inconsistent estimates of parameters and their variances in general. Among others, the problem has been investigated by Hausman and Wise (1981) in a linear model, Manski and Lerman (1977), Manski and McFadden (1981), and Imbens (1992) in discrete models. Wooldridge (1999, 2001) expands the results for M-estimators.

Reviewing the literature shows that both statisticians and econometricians have studied the subject and here we just name a few of them. DuMouchel and Duncan (1983) and Manski and McFadden (1981) study the effect of exogenous stratification in a linear model under SS sampling and maximum likelihood under multinomial sampling respectively. Wooldridge (1999, 2001) studies the case in M-estimators framework and demonstrates it for VP, and SS sampling respectively.

Ignoring stratification raises serious issues in case of endogenous stratification. Endogenous stratification causes inconsistent estimates of parameters and their variances in general. The case has been studied by Hausman and Wise (1981) for a linear model. Manski and Lerman (1977), Manski and McFadden (1981), and Imbens (1992) examine the problem for discrete models. Wooldridge (1999, 2001) expands the results for M-estimators.

In practice combinations of these methods of sampling are commonly used also. For instance the Panel Study of Income Dynamics (PSID) involves stratification and clustering. Bhattacharya (2005) describes a multi-stage sampling in which SS sampling is used in the first level to choose some clusters in each stratum by simple random sampling, and then from each sampled cluster, by random sampling a few observations are chosen again. In this scheme clusters are defined as contiguous groups of units existing within a stratum. For example, in rural area villages can be considered clusters, and in urban areas, they are blocks or neighborhoods and unit observations are households in both cases.

The issue of efficiency in the context of stratified sampling has also been studied. Cosslett (1981a, 1081b, and 1993), and Imbens (1992) examine the efficiency for discrete choice models. Imbens and Lancaster (1996) develop a new estimator for this estimation problem and show that it achieves the semi-parametric efficiency bound in this case.

This paper contributes to the subject of non-random sampling by studying efficiency in panel data models when data set comes from stratified samples. The paper takes into account correlation within each panel and in each stratum under a GMM based framework. Theoretical development shows that by considering correlation within the panels in each stratum and adding them together with appropriate weights produces efficient estimators. Like generalized estimating equations (GEE) we are able to consider the specific form for correlation for panels in each stratum.

Simulation results confirm that the new GMM estimators that we call them weighted and unweighted GLS are more efficient than their competitors OLS and weighted OLS that simply overlook the correlation within the panels. In case of endogenous stratification, weighted GLS and in case of exogenous stratification unweighted GLS is doing better than the rest. For a specific sample size, this efficiency gain depends on what form is chosen for correlation and how strong or weak it is.

In many empirical studies in economics, political science, sociology and other branches of social sciences surveys are standard tools for collecting data. One important aspect of survey data is that observations are not distributed identically and independently anymore. Meanwhile, many methods of estimations like ordinary least squares or maximum likelihood rely on the assumption that the observations are identically and independently distributed (iid) that is violated in survey studies.

We know how by appropriate weighting to fix inconsistency problem when dealing with survey data. However, Finding more efficient estimators helps researchers to increase the precision of their statistical inferences. Efficiency usually comes at a price. It requires stronger assumptions needed for consistency. Here, we assume that the explanatory variables are strictly exogenous. However, in panel data studies, this assumption is violated in models with lagged dependent variables and perhaps in models without lagged dependent variables. Fixed effects (FE) and random effects (RE) are two well-known linear methods used in empirical studies that require strict exogeneity of the estimators. In RE approach, the serial correlation in the composite error is exploited in a generalized least squares (GLS) framework. In GLS procedure, we also need to add assumptions to the conditional variance matrix of the error term.

Finding efficient estimators in models with stratified data is more challenging. The issue has been the subject of interest already. Among others, Cosslett (1981a, 1081b, and 1993), and Imbens (1992) examine the efficiency for discrete choice models. Imbens and Lancaster (1996) develop a new estimator for this estimation problem and show that it achieves the semi-parametric efficiency bound in this case. Recently Tripathi (2011) has developed efficient empirical likelihood-based inference for moment restriction models when data are collected by stratified sampling schemes. In this paper, the efficiency problem is studied in the panel data models with stratified data. The main idea is to use the information within panels similar to the GLS method as a means to gain efficiency. It should be emphasized that we are only approximate the efficient estimator in the sample and try to obtain more efficient estimates compare with pooled estimators that ignore correlations within panels. In other words, our goal in this paper is not to find efficiency bounds of any kind.

The main effect of the clustering and stratified sampling methods is that the data are not identically and independently distributed (iid) anymore. Therefore, assuming iid may cause serious problems.

Three widely known stratified sampling schemes are standard stratified sampling (SS sampling), multinomial sampling, and variable probability sampling (VP sampling). In SS sampling, the population is divided into several subpopulations based on factors like income, race, gender, education, area of residence, etc. Then, a random sample is taken within each subpopulation or stratum independently. The result is a sample of independent, but not identically distributed observations. It should be emphasized that, unlike simple random sampling, in SS sampling the proportions of observations within strata do not reflect population proportions as they would if the sample were selected randomly from the population at large.

A multinomial sampling scheme is similar with SS sampling. The difference is that in multinomial sampling, first a stratum is chosen randomly and then samples randomly from the stratum. Although this kind of sampling is not common in practice, but theoretically it is easier to deal with because it produces iid observations (see Wooldridge 1999).

In VP sampling, which also is known as the Bernoulli, sampling in the literature, first an observation is drawn randomly from the population, then the researcher determines its stratum. After determining the observation's stratum, it will be kept in the sample with specific probability that is set by the researcher as well. If an observation is discarded its values are not recorded.

SS sampling scheme is often used when observations from each stratum are easily identified before sampling. On the other hand, if for example one of the variable of interest is family income that is difficult to be determined before sampling, VP sampling design is more suitable.

Stratification can be done in terms of exogenous variables, endogenous variables or both. Strata are determined after choosing an econometric model. Stratification based on exogenous variables of the chosen model does not produce serious problem; one can ignore it and still get consistent estimates for parameters of interest. See, for example, DuMouchel and Duncan (1983) and Manski and McFadden (1981). On the other hand, ignoring stratification raises serious issues in case of endogenous stratification. Endogenous stratification causes inconsistent estimates of parameters and their variances in general. Among others, the problem has been investigated by Hausman and Wise (1981) in a linear model, Manski and Lerman (1977), Manski and McFadden (1981), and Imbens (1992) in discrete models. Wooldridge (1999, 2001) expands the results for M-estimators.

Reviewing the literature shows that both statisticians and econometricians have studied the subject and here we just name a few of them. DuMouchel and Duncan (1983) and Manski and McFadden (1981) study the effect of exogenous stratification in a linear model under SS sampling and maximum likelihood under multinomial sampling respectively. Wooldridge (1999, 2001) studies the case in M-estimators framework and demonstrates it for VP, and SS sampling respectively.

Ignoring stratification raises serious issues in case of endogenous stratification. Endogenous stratification causes inconsistent estimates of parameters and their variances in general. The case has been studied by Hausman and Wise (1981) for a linear model. Manski and Lerman (1977), Manski and McFadden (1981), and Imbens (1992) examine the problem for discrete models. Wooldridge (1999, 2001) expands the results for M-estimators.

In practice combinations of these methods of sampling are commonly used also. For instance the Panel Study of Income Dynamics (PSID) involves stratification and clustering. Bhattacharya (2005) describes a multi-stage sampling in which SS sampling is used in the first level to choose some clusters in each stratum by simple random sampling, and then from each sampled cluster, by random sampling a few observations are chosen again. In this scheme clusters are defined as contiguous groups of units existing within a stratum. For example, in rural area villages can be considered clusters, and in urban areas, they are blocks or neighborhoods and unit observations are households in both cases.

The issue of efficiency in the context of stratified sampling has also been studied. Cosslett (1981a, 1081b, and 1993), and Imbens (1992) examine the efficiency for discrete choice models. Imbens and Lancaster (1996) develop a new estimator for this estimation problem and show that it achieves the semi-parametric efficiency bound in this case.

This paper contributes to the subject of non-random sampling by studying efficiency in panel data models when data set comes from stratified samples. The paper takes into account correlation within each panel and in each stratum under a GMM based framework. Theoretical development shows that by considering correlation within the panels in each stratum and adding them together with appropriate weights produces efficient estimators. Like generalized estimating equations (GEE) we are able to consider the specific form for correlation for panels in each stratum.

Simulation results confirm that the new GMM estimators that we call them weighted and unweighted GLS are more efficient than their competitors OLS and weighted OLS that simply overlook the correlation within the panels. In case of endogenous stratification, weighted GLS and in case of exogenous stratification unweighted GLS is doing better than the rest. For a specific sample size, this efficiency gain depends on what form is chosen for correlation and how strong or weak it is.

Short title | Asymptotic Efficiency in Panel Models with Survey Data:Application to the PSID Family Income |
---|---|

Status | Active |

Effective start/end date | 3/20/18 → 12/31/20 |

### Fingerprint

Panel model

Panel study

Survey data

Family income

Sampling

Income

Asymptotic efficiency

Stratified sampling

Estimator

Generalized least squares

Ordinary least squares

M-estimator

Efficiency gains

Random sampling

Generalized method of moments

Exogenous variables

Empirical study

Proportion

Competitors

Generalized estimating equations