Statistics, Data Science, and Data Mining Discussion Thread (Business Intelligence, Analytics, etc)

otc · Oct 24, 2014

Sounds about right.

Our sas server forces a structure on users.

Client
-Raw Data (stuff in here gets auto-zipped if not accessed in a certain amount of time)
-SAS Data (only sas datasets, default is read/write access for everyone)
---subfolders here for sub-projects
---different sources
---personal interim data/etc
-SAS Programs (Subfolders created for any user who logs into that client with read only access for other users)
---My folder
----Programs sit here
----Optional subfolders (if project gets long and I want to archive stuff away/keep a specific version/etc
----Output (where I can freely dump temporary PDF/excel type outputs for viewing/emailing without caring about overwriting anything)
-Stata Data
-Stata Programs (similar structure for stata but I don't use it so I don't really pay attention.

Beyond that, organization is left up to the user...I typically create my output folder to dump stuff in. Raw Data doesn't always hold the raw data...we have a staff dedicated to converting data, so sometimes they have the originals (but then usually the SAS datasets can be considered exact transcriptions of the raw data). If data cleaning leads to a new dataset (rather than just a block of code that does the cleaning when reading it in), that will usually end up in a different folder. I often end up with my own folder in the SAS data directory for stuff I am playing with where I might want to write out a permanent dataset.

The system keeps people well enough organized. It's not perfect--organization within a user's folder might be terrible, and even if I create my "own" data folder, people might end up saving their datasets in there--, but when you are talking about people who aren't formally trained programmers and when there might be 40 different people who have logged in to a client and created a program, it at least means you can tell who did what and have a decent chance of finding someone else's code if you need it.

Unfortunately, it is not conducive to version control, which I would prefer. By default, the server cleans up files it doesn't recognize (often zipping unused non-sas-program files to save space) and fiddles with permissions, which makes dropping something like a mercurial repository on the server impossible.

Edit: and I guess I should say, I usually use SAS through Enterprise Guide (although I don't use any of its automated features...just use it to edit code). EG uses project files, so instead of storing a lot of random programs in the directory, I store a project file that contains the code. I might have more than one of these for unrelated tasks that don't share data...and I might make a one-off version when a report goes out that contains only the code needed to produce the numbers in the report (a sort of ghetto version control).

amathew · Oct 29, 2014

^ Wow, thanks for sharing. Out of curiosity, do most people who use SAS perform tasks using the SAS syntax or do they use the point and click variation. I have a copy of SAS enterprise on my work computer but I've never really bothered to look into it.

otc · Oct 29, 2014

I don't know anyone in our office who regularly uses the point-and-click stuff in EG.

I think it is the kind of thing that is not so simple that someone with know knowledge can just use it (like Tableau)...but not powerful enough that anyone who can actually write code would use it. I am also not sure how much data cleaning and manipulation it is capable of...so it is not useful on random outside data (I could see it being useful if you had clean data that was maintained by another department...and you wanted to do a bit of point and click analysis on it).

I've only seen it used or tried to use it myself a couple of times. The one nice thing is that it generates the underlying SAS code. So if I am going to use some graphical or analysis procedure I have never seen before, I could set up the data with code, but then use the point-and-click tool to build up the bones of the procedure. This would tell me the proper syntax, possibly show me options I was not aware of, and structure it in a decent way.
I don't know that that is much of an improvement over just googling something though...

edit: And FWIW, I don't think EG is very good...but, when it comes to interacting with a remote SAS server, I think my options are either EG, a slow/laggy/goofy X-forwarded version of interactive SAS, and using the command line to run stuff in batch mode.

Batch mode is OK (and I use it for huge programs), although I don't have any good program editors that do SAS syntax highlighting. But if I want to be able to do things like scroll through a data set in tabular form, look at intermediate datasets without writing them to a file, or run only select lines of a program...EG or Interactive SAS are my only options. And the x-forwarded version of Unix Interactive SAS is really lacking...so I use EG.

clee1982 · Oct 29, 2014

To mythikl is your data structured? If so better stick wth SQL, if it's unstructured like all tagging then maybe non relational db is better for you

amathew · Dec 6, 2014

One thing I always "struggle" with is nonlinear regression in which the response variable is continuous. I have a rough process in place which I go through, but I'm always looking for better ways. In general, my 'philosophy' is to avoid relying on manual parameter transformations, so I end up using generalized additive models and/or smoothing techniques like splines. Usually, I use a general linear model as a baseline, then test out several generalized linear models, generalized additive models,

multivariate adaptive regression splines, etc. A lot of this is fairly new to me as most of professional career (all two years of it) has been spent on classification problems, so dealing with continuous response variable is a lot more challenging than I thought it would be.

So...for nonlinear regression in which the response variable is continuous, how do you approach those problems? Any models or smoothing techniques you're partial to?

EDIT:
a. Decision trees are also a god send for when dealing with non-linear interactions.
b. Supper vector regression has been blowing my mind recently

http://www.cvip.uofl.edu/wwwcvip/research/publications/TechReport/SVMRegressionTR.pdf

DaveDr89 · Dec 27, 2014

Great thread.

I agree with one of the previous posts RE the lacking of fundamental statistical training by many folks working on data science problems. This also intersects with the generational aspect of how many people use Wikipedia as principal source for learning. While it is definitely a good source, it should not be the only source. Along those lines, when I interview folks for statistician positions at my company I often asked them the following question, "If you could only bring 3 stat books to your next job, what would they be?" Perhaps the question is a bit dated in a digital world, but nevertheless it is surprising how many candidates cannot name 3 stat books (or machine learning, etc.). For anyone wanting to get into data science I would recommend a heavy emphasis on statistical training so that one can distinguish themselves from the crowd. Classics like Frank Harrelll's Regression Modelling Strategies should be high on the list. Finally, the post on Rob Hynd's web site is relevant to this thread:

http://robjhyndman.com/hyndsight/am-i-a-data-scientist/

otc · Dec 27, 2014

What are these book things you are taking about and why do you need three of them?

DaveDr89 · Dec 27, 2014

There is no right answer to the book question, it's just to see if they can recall any books. If they provide an eclectic list then all the better. In any event, here are a few books I'd put on the list:

Casella & Berger, Statistical Inference
Harrell, Regression Modelling Strategies
Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy
Gelman, Data Analysis Using Regression and Multilevel/Hierarchical Models
Hasti, Tibshiriani, Freedman; Elements of Statistical Learning
Ruppet, Statistics and Data Analysis for Financial Engineering

Of course, there are tons of other books that could be on the list. The last book, although tilted toward finance, it a very comprehensive statistic book in its own right and also has lots of R code.

amathew · Dec 27, 2014

DaveDr89 said:
There is no right answer to the book question, it's just to see if they can recall any books. If they provide an eclectic list then all the better. In any event, here are a few books I'd put on the list:

Casella & Berger, Statistical Inference
Harrell, Regression Modelling Strategies
Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy
Gelman, Data Analysis Using Regression and Multilevel/Hierarchical Models
Hasti, Tibshiriani, Freedman; Elements of Statistical Learning
Ruppet, Statistics and Data Analysis for Financial Engineering

Of course, there are tons of other books that could be on the list. The last book, although tilted toward finance, it a very comprehensive statistic book in its own right and also has lots of R code.

Other books (newer ones) worth mentioning...

Categorical Data Analysis - Agresti

Bayesian Data Analysis - Gelman and others

Also, John Fox wrote a good book on Generalized Linear Models, but I forgot the name and actually threw it out when I moved (regret it now)

Then there's Max Khun's Applied Predictive Modeling, which is a must have R/stats book for me as I use the Caret package a lot.

fuji · Dec 27, 2014

DaveDr89 said:
There is no right answer to the book question, it's just to see if they can recall any books. If they provide an eclectic list then all the better. In any event, here are a few books I'd put on the list:

Casella & Berger, Statistical Inference
Harrell, Regression Modelling Strategies
Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy
Gelman, Data Analysis Using Regression and Multilevel/Hierarchical Models
Hasti, Tibshiriani, Freedman; Elements of Statistical Learning
Ruppet, Statistics and Data Analysis for Financial Engineering

Of course, there are tons of other books that could be on the list. The last book, although tilted toward finance, it a very comprehensive statistic book in its own right and also has lots of R code.

Statistics book of the gods.

Did my undergrad in statistics with finance, going to be doing my masters in statistics next year. Focus will be on stochastic calculus, machine learning and time series.

I agree a lot of people doing statistics, don't really seem to understand the underlying principles of what they're doing and just know how to analyse data with R or something. my undergrad didn't have me using a computer until the final year, pretty much just probability and distribution theory, a lot of maths and some markhov chain stochastic process kind of stuff.

DaveDr89 · Dec 28, 2014

Good luck in grad school. The interesting thing about grad programs in stats is that one can obtain completely different training depending on where one goes (probably more so nowadays as programs broaden their offerings). E.g., RE the books above, if the authors were to give respective short courses on statistics they would not have a great deal in common. Speaking of short courses, I would and this one to any short list:

https://users.soe.ucsc.edu/~draper/eBay-Google-2013.html

fuji · Dec 29, 2014

DaveDr89 said:
Good luck in grad school. The interesting thing about grad programs in stats is that one can obtain completely different training depending on where one goes (probably more so nowadays as programs broaden their offerings). E.g., RE the books above, if the authors were to give respective short courses on statistics they would not have a great deal in common. Speaking of short courses, I would and this one to any short list:

https://users.soe.ucsc.edu/~draper/eBay-Google-2013.html

I suppose the same thing applies to undergrad. After reading this thread I googled principle component analysis and it seems quite important. It's not covered in any undergrad course at my uni and the only masters course that covers it is a course in analysing social science data. We do have to take tonnes of linear algebra though so it's a pretty easy to understand concept.

Anyone here work in finance? Doing this Msc most likely and i'd like to know, which courses have the most real life applications.

http://www.lse.ac.uk/statistics/study/prospective/mscstatistics.aspx

amathew · Dec 29, 2014

fuji said:
I suppose the same thing applies to undergrad. After reading this thread I googled principle component analysis and it seems quite important. It's not covered in any undergrad course at my uni and the only masters course that covers it is a course in analysing social science data. We do have to take tonnes of linear algebra though so it's a pretty easy to understand concept.

Anyone here work in finance? Doing this Msc most likely and i'd like to know, which courses have the most real life applications.

http://www.lse.ac.uk/statistics/study/prospective/mscstatistics.aspx

I'm sure there's an undergrad course that teaches explanatory factor analysis, and for many instances that could be enough. Both EFA and PCA are geared towards
a similar "type" of problem after all.

amathew · Dec 29, 2014

amathew said:
Making Sense of PCA...
http://stats.stackexchange.com/ques...l-component-analysis-eigenvectors-eigenvalues

What are the differences between factor analysis and pca
http://stats.stackexchange.com/ques...ctor-analysis-and-principal-component-analysi

VinnyMac · Dec 29, 2014

amathew said:
I'm sure there's an undergrad course that teaches explanatory factor analysis, and for many instances that could be enough. Both EFA and PCA are geared towards
a similar "type" of problem after all.

Great thread guys. I just came across it. The type of topics discussed on SF never stop surprising me.

In response to the above, how do you differentiate between EFA and PCA? I hear people reference PCA as something different from Factor Analysis quite a bit; my understanding is that it's incorrect to do so, but I'm curious to see what you think.

My understanding is that Factor Analysis (whether Exploratory or Confirmatory) is the general technique. Component Analysis (also PCA) and Common Factor Analysis are two methods of extracting factors for Factor Analysis, not separate techniques.

Let's compare that to Multiple Regression analysis. MR is the analytical technique. Stepwise Estimation and Forward Addition/Backwards Elimination are model estimation methods (similar to PCA's role in Factor Analysis). No one refers to Stepwise Estimation as its own technique; it's one option that you can use to create a MR model, but people (erroneously) refer to PCA as a separate technique from Factor Analysis, rather than one possible extraction method that can be used for Factor Analysis.

Exploratory and Confirmatory Factor Analysis are uses of Factor Analysis for certain ends. PCA is an extraction method for Factor Analysis, not a separate technique "geared towards a similar 'type' of problem."

Thoughts?

Statistics, Data Science, and Data Mining Discussion Thread (Business Intelligence, Analytics, etc)

Stylish Dinosaur

Distinguished Member

Stylish Dinosaur

Stylish Dinosaur

Distinguished Member

Senior Member

Stylish Dinosaur

Senior Member

Distinguished Member

Distinguished Member

Senior Member

Distinguished Member

Distinguished Member

Distinguished Member

Distinguished Member

FEATURED PRODUCTS

Similar threads

Featured Sponsor

Definitely full canvas only

Half canvas is fine

Really don't care

Depends on fabric

Depends on price

Forum Sponsors

Forum statistics

FOLLOW STYLEFORUM ON: