출처 : http://gastonsanchez.com/how-to/2012/06/17/PCA-in-R/


5 functions to do Principal Components Analysis in R

Principal Component Analysis (PCA) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data.

From a data analysis standpoint, PCA is used for studying one table of observations and variables with the main idea of transforming the observed variables into a set of new variables, the principal components, which are uncorrelated and explain the variation in the data. For this reason, PCA allows to reduce a “complex” data set to a lower dimension in order to reveal the structures or the dominant types of variations in both the observations and the variables.

PCA in R

In R, there are several functions from different packages that allow us to perform PCA. In this post I’ll show you 5 different ways to do a PCA using the following functions (with their corresponding packages in parentheses):

  • prcomp() (stats)
  • princomp() (stats)
  • PCA() (FactoMineR)
  • dudi.pca() (ade4)
  • acp() (amap)

Brief note: It is no coincidence that the three external packages ("FactoMineR""ade4", and "amap") have been developed by French data analysts, which have a long tradition and preference for PCA and other related exploratory techniques.

No matter what function you decide to use, the typical PCA results should consist of a set of eigenvalues, a table with the scores or Principal Components (PCs), and a table of loadings (or correlations between variables and PCs). The eigenvalues provide information of the variability in the data. The scores provide information about the structure of the observations. The loadings (or correlations) allow you to get a sense of the relationships between variables, as well as their associations with the extracted PCs.

The Data

To make things easier, we’ll use the dataset USArrests that already comes with R. It’s a data frame with 50 rows (USA states) and 4 columns containing information about violent crime rates by US State. Since most of the times the variables are measured in different scales, the PCA must be performed with standardized data (mean = 0, variance = 1). The good news is that all of the functions that perform PCA come with parameters to specify that the analysis must be applied on standardized data.

Option 1: using prcomp()

The function prcomp() comes with the default "stats"package, which means that you don’t have to install anything. It is perhaps the quickest way to do a PCA if you don’t want to install other packages.

# PCA with function prcomp
pca1 = prcomp(USArrests, scale. = TRUE)

# sqrt of eigenvalues
pca1$sdev
## [1] 1.5749 0.9949 0.5971 0.4164
# loadings
head(pca1$rotation)
##              PC1     PC2     PC3      PC4
## Murder   -0.5359  0.4182 -0.3412  0.64923
## Assault  -0.5832  0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780  0.13388
## Rape     -0.5434 -0.1673  0.8178  0.08902
# PCs (aka scores)
head(pca1$x)
##                PC1     PC2      PC3      PC4
## Alabama    -0.9757  1.1220 -0.43980  0.15470
## Alaska     -1.9305  1.0624  2.01950 -0.43418
## Arizona    -1.7454 -0.7385  0.05423 -0.82626
## Arkansas    0.1400  1.1085  0.11342 -0.18097
## California -2.4986 -1.5274  0.59254 -0.33856
## Colorado   -1.4993 -0.9776  1.08400  0.00145

Option 2: using princomp()

The function princomp() also comes with the default "stats" package, and it is very similar to her cousin prcomp(). What I don’t like of princomp() is that sometimes it won’t display all the values for the loadings, but this is a minor detail.

# PCA with function princomp
pca2 = princomp(USArrests, cor = TRUE)

# sqrt of eigenvalues
pca2$sdev
## Comp.1 Comp.2 Comp.3 Comp.4 
## 1.5749 0.9949 0.5971 0.4164
# loadings
unclass(pca2$loadings)
##           Comp.1  Comp.2  Comp.3   Comp.4
## Murder   -0.5359  0.4182 -0.3412  0.64923
## Assault  -0.5832  0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780  0.13388
## Rape     -0.5434 -0.1673  0.8178  0.08902
# PCs (aka scores)
head(pca2$scores)
##             Comp.1  Comp.2   Comp.3    Comp.4
## Alabama    -0.9856  1.1334 -0.44427  0.156267
## Alaska     -1.9501  1.0732  2.04000 -0.438583
## Arizona    -1.7632 -0.7460  0.05478 -0.834653
## Arkansas    0.1414  1.1198  0.11457 -0.182811
## California -2.5240 -1.5429  0.59856 -0.341996
## Colorado   -1.5146 -0.9876  1.09501  0.001465

Option 3: using PCA()

A highly recommended option, especially if you want more detailed results and assessing tools, is the PCA() function from the package "FactoMineR". It is by far the best PCA function in R and it comes with a number of parameters that allow you to tweak the analysis in a very nice way.

# PCA with function PCA
library(FactoMineR)

# apply PCA
pca3 = PCA(USArrests, graph = FALSE)

# matrix with eigenvalues
pca3$eig
##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1     2.4802                 62.006                             62.01
## comp 2     0.9898                 24.744                             86.75
## comp 3     0.3566                  8.914                             95.66
## comp 4     0.1734                  4.336                            100.00
# correlations between variables and PCs
pca3$var$coord
##           Dim.1   Dim.2   Dim.3    Dim.4
## Murder   0.8440 -0.4160  0.2038  0.27037
## Assault  0.9184 -0.1870  0.1601 -0.30959
## UrbanPop 0.4381  0.8683  0.2257  0.05575
## Rape     0.8558  0.1665 -0.4883  0.03707
# PCs (aka scores)
head(pca3$ind$coord)
##              Dim.1   Dim.2    Dim.3     Dim.4
## Alabama     0.9856 -1.1334  0.44427  0.156267
## Alaska      1.9501 -1.0732 -2.04000 -0.438583
## Arizona     1.7632  0.7460 -0.05478 -0.834653
## Arkansas   -0.1414 -1.1198 -0.11457 -0.182811
## California  2.5240  1.5429 -0.59856 -0.341996
## Colorado    1.5146  0.9876 -1.09501  0.001465

Option 4: using dudi.pca()

Another option is to use the dudi.pca() function from the package "ade4" which has a huge amount of other methods as well as some interesting graphics.

# PCA with function dudi.pca
library(ade4)

# apply PCA
pca4 = dudi.pca(USArrests, nf = 5, scannf = FALSE)

# eigenvalues
pca4$eig
## [1] 2.4802 0.9898 0.3566 0.1734
# loadings
pca4$c1
##              CS1     CS2     CS3      CS4
## Murder   -0.5359  0.4182 -0.3412  0.64923
## Assault  -0.5832  0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780  0.13388
## Rape     -0.5434 -0.1673  0.8178  0.08902
# correlations between variables and PCs
pca4$co
##            Comp1   Comp2   Comp3    Comp4
## Murder   -0.8440  0.4160 -0.2038  0.27037
## Assault  -0.9184  0.1870 -0.1601 -0.30959
## UrbanPop -0.4381 -0.8683 -0.2257  0.05575
## Rape     -0.8558 -0.1665  0.4883  0.03707
# PCs
head(pca4$li)
##              Axis1   Axis2    Axis3     Axis4
## Alabama    -0.9856  1.1334 -0.44427  0.156267
## Alaska     -1.9501  1.0732  2.04000 -0.438583
## Arizona    -1.7632 -0.7460  0.05478 -0.834653
## Arkansas    0.1414  1.1198  0.11457 -0.182811
## California -2.5240 -1.5429  0.59856 -0.341996
## Colorado   -1.5146 -0.9876  1.09501  0.001465

Option 5: using acp()

A fifth possibility is the acp() function from the package "amap".

# PCA with function acp
library(amap)

# apply PCA
pca5 = acp(USArrests)

# sqrt of eigenvalues
pca5$sdev
## Comp 1 Comp 2 Comp 3 Comp 4 
## 1.5749 0.9949 0.5971 0.4164
# loadings
pca5$loadings
##          Comp 1  Comp 2  Comp 3   Comp 4
## Murder   0.5359  0.4182 -0.3412  0.64923
## Assault  0.5832  0.1880 -0.2681 -0.74341
## UrbanPop 0.2782 -0.8728 -0.3780  0.13388
## Rape     0.5434 -0.1673  0.8178  0.08902
# scores
head(pca5$scores)
##             Comp 1  Comp 2   Comp 3   Comp 4
## Alabama     0.9757  1.1220 -0.43980  0.15470
## Alaska      1.9305  1.0624  2.01950 -0.43418
## Arizona     1.7454 -0.7385  0.05423 -0.82626
## Arkansas   -0.1400  1.1085  0.11342 -0.18097
## California  2.4986 -1.5274  0.59254 -0.33856
## Colorado    1.4993 -0.9776  1.08400  0.00145

Of course these are not the only options to do a PCA, but I’ll leave the other approaches for another post.

PCA plots

Everybody uses PCA to visualize the data, and most of the discussed functions come with their own plot functions. But you can also make use of the great graphical displays of "ggplot2". Just to show you a couple of plots, let’s take the basic results from prcomp().

Plot of observations

# load ggplot2
library(ggplot2)

# create data frame with scores
scores = as.data.frame(pca1$x)

# plot of observations
ggplot(data = scores, aes(x = PC1, y = PC2, label = rownames(scores))) +
  geom_hline(yintercept = 0, colour = "gray65") +
  geom_vline(xintercept = 0, colour = "gray65") +
  geom_text(colour = "tomato", alpha = 0.8, size = 4) +
  ggtitle("PCA plot of USA States - Crime Rates")

center

Circle of correlations

# function to create a circle
circle <- function(center = c(0, 0), npoints = 100) {
    r = 1
    tt = seq(0, 2 * pi, length = npoints)
    xx = center[1] + r * cos(tt)
    yy = center[1] + r * sin(tt)
    return(data.frame(x = xx, y = yy))
}
corcir = circle(c(0, 0), npoints = 100)

# create data frame with correlations between variables and PCs
correlations = as.data.frame(cor(USArrests, pca1$x))

# data frame with arrows coordinates
arrows = data.frame(x1 = c(0, 0, 0, 0), y1 = c(0, 0, 0, 0), x2 = correlations$PC1, 
    y2 = correlations$PC2)

# geom_path will do open circles
ggplot() + geom_path(data = corcir, aes(x = x, y = y), colour = "gray65") + 
    geom_segment(data = arrows, aes(x = x1, y = y1, xend = x2, yend = y2), colour = "gray65") + 
    geom_text(data = correlations, aes(x = PC1, y = PC2, label = rownames(correlations))) + 
    geom_hline(yintercept = 0, colour = "gray65") + geom_vline(xintercept = 0, 
    colour = "gray65") + xlim(-1.1, 1.1) + ylim(-1.1, 1.1) + labs(x = "pc1 aixs", 
    y = "pc2 axis") + ggtitle("Circle of correlations")

center



Posted by Azel.Kim :

출처 : https://stat.ethz.ch/pipermail/r-help/2007-February/125860.html



[R] Randomly extract rows from a data frame

Using the 'iris' dataset in R: # Select 2 random rows > iris[sample(nrow(iris), 2), ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 96 5.7 3.0 4.2 1.2 versicolor 17 5.4 3.9 1.3 0.4 setosa # Select 5 random rows > iris[sample(nrow(iris), 5), ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 83 5.8 2.7 3.9 1.2 versicolor 12 4.8 3.4 1.6 0.2 setosa 63 6.0 2.2 4.0 1.0 versicolor 80 5.7 2.6 3.5 1.0 versicolor 49 5.3 3.7 1.5 0.2 setosa



Posted by Azel.Kim :

R studio 단축키

2016. 10. 9. 16:24 from 통계/R

출처 : https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts


Keyboard Shortcuts


This information is available directly in the RStudio IDE under the Tools menu:

Tools → Keyboard Shortcuts Help.


Console

DescriptionWindows & LinuxMac
Move cursor to ConsoleCtrl+2Ctrl+2
Clear consoleCtrl+LCtrl+L
Move cursor to beginning of lineHomeCommand+Left
Move cursor to end of lineEndCommand+Right
Navigate command historyUp/DownUp/Down
Popup command historyCtrl+UpCommand+Up
Interrupt currently executing commandEscEsc
Change working directoryCtrl+Shift+HCtrl+Shift+H
 

Source

DescriptionWindows & LinuxMac
Goto File/FunctionCtrl+.Ctrl+.
Move cursor to Source EditorCtrl+1Ctrl+1
New document (except on Chrome/Windows)Ctrl+Shift+NCommand+Shift+N
New document (Chrome only)Ctrl+Alt+Shift+NCommand+Shift+Alt+N
Open documentCtrl+OCommand+O
Save active documentCtrl+SCommand+S
Close active document (except on Chrome)Ctrl+WCommand+W
Close active document (Chrome only)Ctrl+Alt+WCommand+Option+W
Close all open documentsCtrl+Shift+WCommand+Shift+W
Preview HTML (Markdown and HTML)Ctrl+Shift+KCommand+Shift+K
Knit Document (knitr)Ctrl+Shift+KCommand+Shift+K
Compile NotebookCtrl+Shift+KCommand+Shift+K
Compile PDF (TeX and Sweave)Ctrl+Shift+KCommand+Shift+K
Insert chunk (Sweave and Knitr)Ctrl+Alt+ICommand+Option+I
Insert code sectionCtrl+Shift+RCommand+Shift+R
Run current line/selectionCtrl+EnterCommand+Enter
Run current line/selection (retain cursor position)Alt+EnterOption+Enter
Re-run previous regionCtrl+Shift+PCommand+Shift+P
Run current documentCtrl+Alt+RCommand+Option+R
Run from document beginning to current lineCtrl+Alt+BCommand+Option+B
Run from current line to document endCtrl+Alt+ECommand+Option+E
Run the current function definitionCtrl+Alt+FCommand+Option+F
Run the current code sectionCtrl+Alt+TCommand+Option+T
Run previous Sweave/Rmd codeCtrl+Alt+PCommand+Option+P
Run the current Sweave/Rmd chunkCtrl+Alt+CCommand+Option+C
Run the next Sweave/Rmd chunkCtrl+Alt+NCommand+Option+N
Source a fileCtrl+Shift+OCommand+Shift+O
Source the current documentCtrl+Shift+SCommand+Shift+S
Source the current document (with echo)Ctrl+Shift+EnterCommand+Shift+Enter
Fold SelectedAlt+LCmd+Option+L
Unfold SelectedShift+Alt+LCmd+Shift+Option+L
Fold AllAlt+OCmd+Option+O
Unfold AllShift+Alt+OCmd+Shift+Option+O
Go to lineShift+Alt+GCmd+Shift+Option+G
Jump toShift+Alt+JCmd+Shift+Option+J
Switch to tabCtrl+Shift+.Ctrl+Shift+.
Previous tabCtrl+F11Ctrl+F11
Next tabCtrl+F12Ctrl+F12
First tabCtrl+Shift+F11Ctrl+Shift+F11
Last tabCtrl+Shift+F12Ctrl+Shift+F12
Navigate backCtrl+F9Cmd+F9
Navigate forwardCtrl+F10Cmd+F10
Extract function from selectionCtrl+Alt+XCommand+Option+X
Extract variable from selectionCtrl+Alt+VCommand+Option+V
Reindent linesCtrl+ICommand+I
Comment/uncomment current line/selectionCtrl+Shift+CCommand+Shift+C
Reflow CommentCtrl+Shift+/Command+Shift+/
Reformat SelectionCtrl+Shift+ACommand+Shift+A
Show DiagnosticsCtrl+Shift+Alt+PCommand+Shift+Alt+P
Transpose Letters Ctrl+T
Move Lines Up/DownAlt+Up/DownOption+Up/Down
Copy Lines Up/DownShift+Alt+Up/DownCommand+Option+Up/Down
Jump to Matching Brace/ParenCtrl+PCtrl+P
Expand to Matching Brace/ParenCtrl+Shift+ECtrl+Shift+E
Select to Matching Brace/ParenCtrl+Shift+Alt+ECtrl+Shift+Alt+E
Add Cursor Above Current CursorCtrl+Alt+UpCtrl+Alt+Up
Add Cursor Below Current CursorCtrl+Alt+DownCtrl+Alt+Down
Move Active Cursor UpCtrl+Alt+Shift+UpCtrl+Alt+Shift+Up
Move Active Cursor DownCtrl+Alt+Shift+DownCtrl+Alt+Shift+Down
Find and ReplaceCtrl+FCommand+F
Find NextWin: F3, Linux: Ctrl+GCommand+G
Find PreviousWin: Shift+F3, Linux: Ctrl+Shift+GCommand+Shift+G
Use Selection for FindCtrl+F3Command+E
Replace and FindCtrl+Shift+JCommand+Shift+J
Find in FilesCtrl+Shift+FCommand+Shift+F
Check SpellingF7F7
 

Editing (Console and Source)

DescriptionWindows & LinuxMac
UndoCtrl+ZCommand+Z
RedoCtrl+Shift+ZCommand+Shift+Z
CutCtrl+XCommand+X
CopyCtrl+CCommand+C
PasteCtrl+VCommand+V
Select AllCtrl+ACommand+A
Jump to WordCtrl+Left/RightOption+Left/Right
Jump to Start/EndCtrl+Home/End or Ctrl+Up/DownCommand+Home/End or Command+Up/Down
Delete LineCtrl+DCommand+D
SelectShift+[Arrow]Shift+[Arrow]
Select WordCtrl+Shift+Left/RightOption+Shift+Left/Right
Select to Line StartAlt+Shift+LeftCommand+Shift+Left
Select to Line EndAlt+Shift+RightCommand+Shift+Right
Select Page Up/DownShift+PageUp/PageDownShift+PageUp/Down
Select to Start/EndCtrl+Shift+Home/End or Shift+Alt+Up/DownCommand+Shift+Up/Down
Delete Word LeftCtrl+BackspaceOption+Backspace or Ctrl+Option+Backspace
Delete Word Right Option+Delete
Delete to Line End Ctrl+K
Delete to Line Start Option+Backspace
IndentTab (at beginning of line)Tab (at beginning of line)
OutdentShift+TabShift+Tab
Yank line up to cursorCtrl+UCtrl+U
Yank line after cursorCtrl+KCtrl+K
Insert currently yanked textCtrl+YCtrl+Y
Insert assignment operatorAlt+-Option+-
Insert pipe operatorCtrl+Shift+MCmd+Shift+M
Show help for function at cursorF1F1
Show source code for function at cursorF2F2
Find usages for symbol at cursor (C++)Ctrl+Alt+UCmd+Option+U
 

Completions (Console and Source)

DescriptionWindows & LinuxMac
Attempt completionTab or Ctrl+SpaceTab or Command+Space
Navigate candidatesUp/DownUp/Down
Accept selected candidateEnter, Tab, or RightEnter, Tab, or Right
Dismiss completion popupEscEsc
 

Views

DescriptionWindows & LinuxMac
Move focus to Source EditorCtrl+1Ctrl+1
Move focus to ConsoleCtrl+2Ctrl+2
Move focus to HelpCtrl+3Ctrl+3
Show HistoryCtrl+4Ctrl+4
Show FilesCtrl+5Ctrl+5
Show PlotsCtrl+6Ctrl+6
Show PackagesCtrl+7Ctrl+7
Show EnvironmentCtrl+8Ctrl+8
Show Git/SVNCtrl+9Ctrl+9
Show BuildCtrl+0Ctrl+0
Sync Editor & PDF PreviewCtrl+F8Cmd+F8
Show Keyboard Shortcut ReferenceAlt+Shift+KOption+Shift+K
 

Build

DescriptionWindows & LinuxMac
Build and ReloadCtrl+Shift+BCmd+Shift+B
Load All (devtools)Ctrl+Shift+LCmd+Shift+L
Test Package (Desktop)Ctrl+Shift+TCmd+Shift+T
Test Package (Web)Ctrl+Alt+F7Cmd+Alt+F7
Check PackageCtrl+Shift+ECmd+Shift+E
Document PackageCtrl+Shift+DCmd+Shift+D
 

Debug

DescriptionWindows & LinuxMac
Toggle BreakpointShift+F9Shift+F9
Execute Next LineF10F10
Step Into FunctionShift+F4Shift+F4
Finish Function/LoopShift+F6Shift+F6
ContinueShift+F5Shift+F5
Stop DebuggingShift+F8Shift+F8
 

Plots

DescriptionWindows & LinuxMac
Previous plotCtrl+Alt+F11Command+Option+F11
Next plotCtrl+Alt+F12Command+Option+F12
 

Git/SVN

DescriptionWindows & LinuxMac
Diff active source documentCtrl+Alt+DCtrl+Option+D
Commit changesCtrl+Alt+MCtrl+Option+M
Scroll diff viewCtrl+Up/DownCtrl+Up/Down
Stage/Unstage (Git)SpacebarSpacebar
Stage/Unstage and move to next (Git)EnterEnter
 

Session

DescriptionWindows & LinuxMac
Quit Session (desktop only)Ctrl+QCommand+Q
Restart R SessionCtrl+Shift+F10Command+Shift+F10


Posted by Azel.Kim :

디랩

http://www.dator.co.kr/


우리나라 DB관련 포럼 올라오기도..



ODPia

http://www.odpia.org


LG CNS에서 공개한 빅데이터 분석가를 위한 오픈 데이터 플랫폼.

공공데이터를 한 곳에서 모아서 보기 용이.

아직은 좀 더 자료가 필요할 것으로 보임.

관련 블로터 기사(http://www.bloter.net/archives/248598)



야후! 오픈 데이터셋

http://webscope.sandbox.yahoo.com/#datasets


야후! webscope

http://webscope.sandbox.yahoo.com/


머신러닝 연구자를 위한 무료 예제 데이터 13테라바이트(TB).

예제 데이터는 야후의 2천만 사용자가 2015년 2월부터 5월까지 야후 뉴스피드에 보낸 정보다. 사용자 정보는 익명 처리했다.

여기에는 야후 뉴스, 야후 스포츠, 야후 파이넨스, 야후 무비, 야후 이스테이트 관련 데이터들이 포함돼 있다.

야후는 1.5TB 규모의 샘플 데이터를 따로 만들어 데이터 정보를 더 쉽게 볼 수 있도록 지원했다.

샘플 데이터는 나이 성별, 지리 정보 등으로 분류됐다.

데이터는 야후 계정이 있어야 내려받을 수 있다.

웹스코프에서 제공되는 데이터는 연구 및 비영리 목적에 한해 자유롭게 이용할 수 있다.

관련 블로터 기사(http://www.bloter.net/archives/247973)

Posted by Azel.Kim :

Memo.



데이터 전문가 온라인 교육 사이트

http://cyber.dbguide.net/


파이썬, R을 이용한 데이터분석은 무료

나머지 강좌도 비싸지는 않은 편.


강의수준은 아직 못봐서 잘 모르겠음.

'Data Scientist' 카테고리의 다른 글

데이터 관련 사이트...  (0) 2016.01.28
데이터 사이언티스트란? (링크)  (0) 2015.12.17
Posted by Azel.Kim :