• Hi, I am the owner and main administrator of Styleforum. If you find the forum useful and fun, please help support it by buying through the posted links on the forum. Our main, very popular sales thread, where the latest and best sales are listed, are posted HERE

    Purchases made through some of our links earns a commission for the forum and allows us to do the work of maintaining and improving it. Finally, thanks for being a part of this community. We realize that there are many choices today on the internet, and we have all of you to thank for making Styleforum the foremost destination for discussions of menswear.
  • This site contains affiliate links for which Styleforum may be compensated.
  • UNIFORM LA CHILLICOTHE WORK JACKET Drop, going on right now.

    Uniform LA's Chillicothe Work Jacket is an elevated take on the classic Detroit Work Jacket. Made of ultra-premium 14-ounce Japanese canvas, it has been meticulously washed and hand distressed to replicate vintage workwear that’s been worn for years, and available in three colors.

    This just dropped today. If you missed out on the preorder, there are some sizes left, but they won't be around for long. Check out the remaining stock here

    Good luck!.

  • STYLE. COMMUNITY. GREAT CLOTHING.

    Bored of counting likes on social networks? At Styleforum, you’ll find rousing discussions that go beyond strings of emojis.

    Click Here to join Styleforum's thousands of style enthusiasts today!

    Styleforum is supported in part by commission earning affiliate links sitewide. Please support us by using them. You may learn more here.

Statistics, Data Science, and Data Mining Discussion Thread (Business Intelligence, Analytics, etc)

otc

Stylish Dinosaur
Joined
Aug 15, 2008
Messages
24,539
Reaction score
19,196
Sounds about right.

Our sas server forces a structure on users.

Client
-Raw Data (stuff in here gets auto-zipped if not accessed in a certain amount of time)
-SAS Data (only sas datasets, default is read/write access for everyone)
---subfolders here for sub-projects
---different sources
---personal interim data/etc
-SAS Programs (Subfolders created for any user who logs into that client with read only access for other users)
---My folder
----Programs sit here
----Optional subfolders (if project gets long and I want to archive stuff away/keep a specific version/etc
----Output (where I can freely dump temporary PDF/excel type outputs for viewing/emailing without caring about overwriting anything)
-Stata Data
-Stata Programs (similar structure for stata but I don't use it so I don't really pay attention.


Beyond that, organization is left up to the user...I typically create my output folder to dump stuff in. Raw Data doesn't always hold the raw data...we have a staff dedicated to converting data, so sometimes they have the originals (but then usually the SAS datasets can be considered exact transcriptions of the raw data). If data cleaning leads to a new dataset (rather than just a block of code that does the cleaning when reading it in), that will usually end up in a different folder. I often end up with my own folder in the SAS data directory for stuff I am playing with where I might want to write out a permanent dataset.

The system keeps people well enough organized. It's not perfect--organization within a user's folder might be terrible, and even if I create my "own" data folder, people might end up saving their datasets in there--, but when you are talking about people who aren't formally trained programmers and when there might be 40 different people who have logged in to a client and created a program, it at least means you can tell who did what and have a decent chance of finding someone else's code if you need it.

Unfortunately, it is not conducive to version control, which I would prefer. By default, the server cleans up files it doesn't recognize (often zipping unused non-sas-program files to save space) and fiddles with permissions, which makes dropping something like a mercurial repository on the server impossible.

Edit: and I guess I should say, I usually use SAS through Enterprise Guide (although I don't use any of its automated features...just use it to edit code). EG uses project files, so instead of storing a lot of random programs in the directory, I store a project file that contains the code. I might have more than one of these for unrelated tasks that don't share data...and I might make a one-off version when a report goes out that contains only the code needed to produce the numbers in the report (a sort of ghetto version control).
 
Last edited:

amathew

Distinguished Member
Joined
Nov 4, 2011
Messages
1,501
Reaction score
228
^ Wow, thanks for sharing. Out of curiosity, do most people who use SAS perform tasks using the SAS syntax or do they use the point and click variation. I have a copy of SAS enterprise on my work computer but I've never really bothered to look into it.
 

otc

Stylish Dinosaur
Joined
Aug 15, 2008
Messages
24,539
Reaction score
19,196
I don't know anyone in our office who regularly uses the point-and-click stuff in EG.

I think it is the kind of thing that is not so simple that someone with know knowledge can just use it (like Tableau)...but not powerful enough that anyone who can actually write code would use it. I am also not sure how much data cleaning and manipulation it is capable of...so it is not useful on random outside data (I could see it being useful if you had clean data that was maintained by another department...and you wanted to do a bit of point and click analysis on it).

I've only seen it used or tried to use it myself a couple of times. The one nice thing is that it generates the underlying SAS code. So if I am going to use some graphical or analysis procedure I have never seen before, I could set up the data with code, but then use the point-and-click tool to build up the bones of the procedure. This would tell me the proper syntax, possibly show me options I was not aware of, and structure it in a decent way.
I don't know that that is much of an improvement over just googling something though...

edit: And FWIW, I don't think EG is very good...but, when it comes to interacting with a remote SAS server, I think my options are either EG, a slow/laggy/goofy X-forwarded version of interactive SAS, and using the command line to run stuff in batch mode.

Batch mode is OK (and I use it for huge programs), although I don't have any good program editors that do SAS syntax highlighting. But if I want to be able to do things like scroll through a data set in tabular form, look at intermediate datasets without writing them to a file, or run only select lines of a program...EG or Interactive SAS are my only options. And the x-forwarded version of Unix Interactive SAS is really lacking...so I use EG.
 
Last edited:

clee1982

Stylish Dinosaur
Joined
Feb 22, 2009
Messages
28,974
Reaction score
24,811
To mythikl is your data structured? If so better stick wth SQL, if it's unstructured like all tagging then maybe non relational db is better for you
 

amathew

Distinguished Member
Joined
Nov 4, 2011
Messages
1,501
Reaction score
228
One thing I always "struggle" with is nonlinear regression in which the response variable is continuous. I have a rough process in place which I go through, but I'm always looking for better ways. In general, my 'philosophy' is to avoid relying on manual parameter transformations, so I end up using generalized additive models and/or smoothing techniques like splines. Usually, I use a general linear model as a baseline, then test out several generalized linear models, generalized additive models,
multivariate adaptive regression splines, etc. A lot of this is fairly new to me as most of professional career (all two years of it) has been spent on classification problems, so dealing with continuous response variable is a lot more challenging than I thought it would be.

So...for nonlinear regression in which the response variable is continuous, how do you approach those problems? Any models or smoothing techniques you're partial to?

EDIT:
a. Decision trees are also a god send for when dealing with non-linear interactions.
b. Supper vector regression has been blowing my mind recently
http://www.cvip.uofl.edu/wwwcvip/research/publications/TechReport/SVMRegressionTR.pdf
 
Last edited:

DaveDr89

Senior Member
Joined
Oct 25, 2007
Messages
264
Reaction score
0
Great thread.

I agree with one of the previous posts RE the lacking of fundamental statistical training by many folks working on data science problems. This also intersects with the generational aspect of how many people use Wikipedia as principal source for learning. While it is definitely a good source, it should not be the only source. Along those lines, when I interview folks for statistician positions at my company I often asked them the following question, "If you could only bring 3 stat books to your next job, what would they be?" Perhaps the question is a bit dated in a digital world, but nevertheless it is surprising how many candidates cannot name 3 stat books (or machine learning, etc.). For anyone wanting to get into data science I would recommend a heavy emphasis on statistical training so that one can distinguish themselves from the crowd. Classics like Frank Harrelll's Regression Modelling Strategies should be high on the list. Finally, the post on Rob Hynd's web site is relevant to this thread:

http://robjhyndman.com/hyndsight/am-i-a-data-scientist/
 

otc

Stylish Dinosaur
Joined
Aug 15, 2008
Messages
24,539
Reaction score
19,196
What are these book things you are taking about and why do you need three of them?
 

DaveDr89

Senior Member
Joined
Oct 25, 2007
Messages
264
Reaction score
0
There is no right answer to the book question, it's just to see if they can recall any books. If they provide an eclectic list then all the better. In any event, here are a few books I'd put on the list:

Casella & Berger, Statistical Inference
Harrell, Regression Modelling Strategies
Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy
Gelman, Data Analysis Using Regression and Multilevel/Hierarchical Models
Hasti, Tibshiriani, Freedman; Elements of Statistical Learning
Ruppet, Statistics and Data Analysis for Financial Engineering

Of course, there are tons of other books that could be on the list. The last book, although tilted toward finance, it a very comprehensive statistic book in its own right and also has lots of R code.
 

amathew

Distinguished Member
Joined
Nov 4, 2011
Messages
1,501
Reaction score
228
There is no right answer to the book question, it's just to see if they can recall any books. If they provide an eclectic list then all the better. In any event, here are a few books I'd put on the list:

Casella & Berger, Statistical Inference
Harrell, Regression Modelling Strategies
Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy
Gelman, Data Analysis Using Regression and Multilevel/Hierarchical Models
Hasti, Tibshiriani, Freedman; Elements of Statistical Learning
Ruppet, Statistics and Data Analysis for Financial Engineering

Of course, there are tons of other books that could be on the list. The last book, although tilted toward finance, it a very comprehensive statistic book in its own right and also has lots of R code.

Other books (newer ones) worth mentioning...

Categorical Data Analysis - Agresti

Bayesian Data Analysis - Gelman and others

Also, John Fox wrote a good book on Generalized Linear Models, but I forgot the name and actually threw it out when I moved (regret it now)

Then there's Max Khun's Applied Predictive Modeling, which is a must have R/stats book for me as I use the Caret package a lot.
 
Last edited:

fuji

Distinguished Member
Joined
Sep 5, 2008
Messages
7,050
Reaction score
1,434

There is no right answer to the book question, it's just to see if they can recall any books. If they provide an eclectic list then all the better. In any event, here are a few books I'd put on the list:

Casella & Berger, Statistical Inference
Harrell, Regression Modelling Strategies
Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy
Gelman, Data Analysis Using Regression and Multilevel/Hierarchical Models
Hasti, Tibshiriani, Freedman; Elements of Statistical Learning
Ruppet, Statistics and Data Analysis for Financial Engineering

Of course, there are tons of other books that could be on the list. The last book, although tilted toward finance, it a very comprehensive statistic book in its own right and also has lots of R code.




Statistics book of the gods.


Did my undergrad in statistics with finance, going to be doing my masters in statistics next year. Focus will be on stochastic calculus, machine learning and time series.


I agree a lot of people doing statistics, don't really seem to understand the underlying principles of what they're doing and just know how to analyse data with R or something. my undergrad didn't have me using a computer until the final year, pretty much just probability and distribution theory, a lot of maths and some markhov chain stochastic process kind of stuff.
 
Last edited:

DaveDr89

Senior Member
Joined
Oct 25, 2007
Messages
264
Reaction score
0
Good luck in grad school. The interesting thing about grad programs in stats is that one can obtain completely different training depending on where one goes (probably more so nowadays as programs broaden their offerings). E.g., RE the books above, if the authors were to give respective short courses on statistics they would not have a great deal in common. Speaking of short courses, I would and this one to any short list:

https://users.soe.ucsc.edu/~draper/eBay-Google-2013.html
 

fuji

Distinguished Member
Joined
Sep 5, 2008
Messages
7,050
Reaction score
1,434

Good luck in grad school. The interesting thing about grad programs in stats is that one can obtain completely different training depending on where one goes (probably more so nowadays as programs broaden their offerings). E.g., RE the books above, if the authors were to give respective short courses on statistics they would not have a great deal in common. Speaking of short courses, I would and this one to any short list:

https://users.soe.ucsc.edu/~draper/eBay-Google-2013.html



I suppose the same thing applies to undergrad. After reading this thread I googled principle component analysis and it seems quite important. It's not covered in any undergrad course at my uni and the only masters course that covers it is a course in analysing social science data. We do have to take tonnes of linear algebra though so it's a pretty easy to understand concept.


Anyone here work in finance? Doing this Msc most likely and i'd like to know, which courses have the most real life applications.

http://www.lse.ac.uk/statistics/study/prospective/mscstatistics.aspx
 

amathew

Distinguished Member
Joined
Nov 4, 2011
Messages
1,501
Reaction score
228
I suppose the same thing applies to undergrad. After reading this thread I googled principle component analysis and it seems quite important. It's not covered in any undergrad course at my uni and the only masters course that covers it is a course in analysing social science data. We do have to take tonnes of linear algebra though so it's a pretty easy to understand concept.


Anyone here work in finance? Doing this Msc most likely and i'd like to know, which courses have the most real life applications.

http://www.lse.ac.uk/statistics/study/prospective/mscstatistics.aspx

I'm sure there's an undergrad course that teaches explanatory factor analysis, and for many instances that could be enough. Both EFA and PCA are geared towards
a similar "type" of problem after all.
 

VinnyMac

Distinguished Member
Joined
Sep 15, 2012
Messages
1,865
Reaction score
144
I'm sure there's an undergrad course that teaches explanatory factor analysis, and for many instances that could be enough. Both EFA and PCA are geared towards
a similar "type" of problem after all.

Great thread guys. I just came across it. The type of topics discussed on SF never stop surprising me.

In response to the above, how do you differentiate between EFA and PCA? I hear people reference PCA as something different from Factor Analysis quite a bit; my understanding is that it's incorrect to do so, but I'm curious to see what you think.

My understanding is that Factor Analysis (whether Exploratory or Confirmatory) is the general technique. Component Analysis (also PCA) and Common Factor Analysis are two methods of extracting factors for Factor Analysis, not separate techniques.

Let's compare that to Multiple Regression analysis. MR is the analytical technique. Stepwise Estimation and Forward Addition/Backwards Elimination are model estimation methods (similar to PCA's role in Factor Analysis). No one refers to Stepwise Estimation as its own technique; it's one option that you can use to create a MR model, but people (erroneously) refer to PCA as a separate technique from Factor Analysis, rather than one possible extraction method that can be used for Factor Analysis.

Exploratory and Confirmatory Factor Analysis are uses of Factor Analysis for certain ends. PCA is an extraction method for Factor Analysis, not a separate technique "geared towards a similar 'type' of problem."

Thoughts?
 

Featured Sponsor

How important is full vs half canvas to you for heavier sport jackets?

  • Definitely full canvas only

    Votes: 95 37.8%
  • Half canvas is fine

    Votes: 91 36.3%
  • Really don't care

    Votes: 28 11.2%
  • Depends on fabric

    Votes: 42 16.7%
  • Depends on price

    Votes: 38 15.1%

Forum statistics

Threads
507,109
Messages
10,593,828
Members
224,356
Latest member
shoeaffinity
Top