********************(c) Jan Kabatek, Universiteit van Tilburg ***************************
* This is a resulting do file for 1st tutorial of Econometrics 1 for research masters. *
* *
* Contents: simulating data, descriptive statistics, regression output (graphics *
* are commented to speed up the computation) *
* *
* IMPORTANT: before asking me about a specific command in this code (or any code), *
* consult a HELP FILE by parsing: help into the STATA console. *
* *
* Version 19.9.2011 , j.kabatek@uvt.nl, K606 (should you find any typos, let me know) *
*****************************************************************************************
*preliminary commands
capture mkdir "M:\_STATA" //mkdir creates a directory.
capture mkdir "M:\_STATA\Econometrics1" //capture overrides errors in case the directory is already there
capture mkdir "M:\_STATA\Econometrics1\tut1"
capture cd "M:\_STATA\Econometrics1\tut1" //set the working directory for input/output files
clear all //clear any data which is still in memory
capture log close //initialization of a new log file
log using metrics1_tut1.txt, replace text
set more off //disable "more" command
set mem 250m //memory allocation
set seed 123456 //change the seed to your ANR number (also in homeworks!)
set obs 1000 //how many observations are going to be in the dataset (not needed if we use external datasets)
*simulating data
gen x = rnormal(1,0.5) //drawing from normal distribution with mean 1 and s.d. 0.5
* browse
gen e = rnormal(0,5)
gen y = 5 + 5*x + e
* gen y = 5 + 5*x + 100*e //errors get more pronounced
* gen y = 5 + 5*x + e*x //errors are now heteroskedastic (betas are consistent, but not efficient ->robust methods are neccessary)
*labelling variables
label variable x "regressor"
label variable y "regressand"
label variable e "error"
*descriptive statistics
describe //see labels, data description (shortcut "d")
summarize //see statistics (shortcut "sum")
sum if x>1, detail //more statistics, only observations which have x>1
sum if x>1 & x<=2 //multiple if clauses are connected by "&" (valid (in)equality symols: > < >= <= == != =~)
tab y //see the values of y (sorted)
*graphs
* hist x //histogram
* hist y
* twoway (scatter y x) //basic scatterplot of two variables
*OLS and playing with the output
reg y x //OLS regression
reg y x, robust //significance levels drop significantly if we allow for heterosk. errors
predict yhat //fitted values yhat=XB
predict ehat, resid //fitted residuals ehat=y - yhat
corr ehat e //the correlation between fitted and original residuals is not ideal because of finite sample properties
* twoway (scatter y yhat x) //checking how our regression performs on the data
*more useful commands
preserve //stores current dataset in the memory
keep yhat ehat //drops all vars except for the list
drop * //drops variables in the list ( a star symbol stands for all variables)
restore //restore the dataset which was preserved earlier
save metrics1_tut1.dta, replace //save your dataset ("replace" overwrites any previous entries)
************************************************************************************************
*moving to an exmple dataset
clear
sysuse bplong.dta
describe
tab sex agegrp //two-way statistics
egen meanbp=mean(bp) //generating variables containing statistics (look into the help file!)
egen maxbp=max(bp) //...
*generating dummies representing different age groups
/*1*/ gen age1=0 //painful way... repeat for age2, age3
replace age1=1 if agegrp==1
/*2*/ gen agex1=(agegrp==1) //...less painful way... repeat for age2, age3
/*3*/ tab agegrp, gen(age_) //...now we're talking!!
* usage of commands within subgroups of our sample
bys agegrp: sum bp //bys = BY SORT -> sorting the variables according to agegrp, and presenting statistics within the three groups
//similar subgroup analysis can be done with almost all other commands (tab, egen, reg, hist...)
* same thing done through different "for" commands. For details, see the corresponding help files
/*1*/ for var age_1 age_2 age_3: sum bp if X==1
/*2*/ for var age_*: sum bp if X==1 //a star represents any suffix after "age_" (same as in MS-DOS command line, if you by any chance know...)
/*3*/ forvalues i = 1(1)3{
sum bp if age_`i'==1
}
/*4*/ foreach blabla of varlist age_*{
sum bp if `blabla'==1
}
*dataset merging
* merge 1:1 _n using "metrics1_tut1" //an example of merging datasets, better look at the selection in data->combine datasets->merge...
log close //this saves your log-file.