'통계'에 해당되는 글 11건

2017.09.05 Mac에서 RStudio에서 txt파일 불러올 때 한글 깨지는 문제 해결 in R 1
2017.08.03 머신러닝?
2017.05.11 추천시스템 (공부중 정리)
2016.11.07 princomp함수를 이용한 주성분분석
2016.11.07 [R] PCA 주성분분석 5가지 함수

Mac에서 RStudio에서 txt파일 불러올 때 한글 깨지는 문제 해결 in R

2017. 9. 5. 15:56 from 통계/R

맥OS 환경에서 RStudio에 관련된 문제 중 하나는 '한글 인코딩'입니다.

개발자가 아니라서 자세한 이유는 잘 모르지만,

아무튼 R공부를 하는 중에 readLines()함수를 이용해 txt 파일을 불러올 때

한글이 제대로 불러와지지 않는 문제가 발생합니다.

> txt <- readLines("test.txt")

Warning message:

In readLines("test.txt") : incomplete final line found on 'test.txt'

> head(txt)

[1] "\"\xba\xb8\xb0\xed \xbdʹ\xd9"

[2] "\xc0̷\xb8\xb0\xd4 \xb8\xbb\xc7ϴϱ\xee \xb4\xf5 \xba\xb8\xb0\xed \xbdʹ\xd9"

[3] "\xb3\xca\xc8\xf1 \xbb\xe7\xc1\xf8\xc0\xbb \xba\xb8\xb0\xed \xc0־"

[4] "\xba\xb8\xb0\xed \xbdʹ\xd9"

[5] "\xb3ʹ\xab \xbe\u07fc\xd3\xc7\xd1 \xbdð\xa3"

[6] "\xb3\xaa\xb4\xc2 \xbf츮\xb0\xa1 \xb9Ӵ\xd9"

검색을 통해 알아낸 방법들로

파일을 불러올 때 옵션으로 인코딩을 지정한다던가,

txt 파일을 새로 만들면서 'UTF-8' 혹은 'euc-kr'으로 저장하는 방법을 사용해보기도 했지만

저는 제대로 해결되지 않았어요.

아무튼 어떻게 해결 했냐면, 일단 RStudio 자체의 인코딩 인식과 방법에 관련된 문제이지 싶어서

RStudio에서 txt 파일을 생성해 저장함으로써 간단하게 해결했습니다.

자세한 원리는 모르겠지만...

RStudio 환경에서 생성하면 사용가능한 인코딩으로 저장되지 않을까? 하는 생각으로 시도했는데 먹혀들었네요ㅎㅎ

환경설정에서 확인해 봤을 땐 system default가 UTF-8이던데

어째서 txt 파일을 UTF-8 인코딩으로 저장했을 때도 같은 문제가 발생했는지는 모르겠지만요...

저작자표시

'통계 > R' 카테고리의 다른 글

princomp함수를 이용한 주성분분석 (0)	2016.11.07
[R] PCA 주성분분석 5가지 함수 (0)	2016.11.07
[R] data frame에서 행 랜덤표본추출 방법 (0)	2016.11.07
R studio 단축키 (0)	2016.10.09
[R] 거리행렬로 군집분석 / distance matrix clustering (0)	2015.06.19

Posted by Azel.Kim :

머신러닝?

2017. 8. 3. 17:08 from 통계/데이터 마이닝

머신러닝?

: 주어진 입력 데이터를 컴퓨터 프로그램이 학습하여 예측을 수행하고 스스로의 예측 성능을 향상시키는 과정과 이를 위한 알고리즘을 연구하고 구축하는 기술

cf. 데이터마이닝

: 대규모로 저장된 데이터 안에서 체계적이고 자동적으로 의미 있는 규칙이나 패턴을 발견하고 이를 지식화하는 과정

머신러닝 카테고리

1) 지도학습 Supervised Learning

입력(input)에 대한 결과(output)를 알고 있는 데이터를 분석하여 함수화하거나 분류하여

새로운 데이터(목표변수, 종속변수)를 예측하는 방법

ex) 텍스트 인식, 사진 인식, 신용평가, 의사결정나무, 판별분석, 회귀분석 등

2) 비지도학습 Unsupervised Learning

입력(input)에 대한 결과(output)가 없는 데이터들을 분석하여 연관짓는다.

ex) 군집분석, 연관성분석, 연관성규칙발견 등

3) 강화학습 Reinforcement Learning

게임을 진행하면서 승리, 패배시 보상과 패널티를 주면서 학습시킨다.

ex) 알파고

저작자표시 비영리 동일조건

Posted by Azel.Kim :

[R] PCA 주성분분석 5가지 함수

2016. 11. 7. 01:46 from 통계/R

출처 : http://gastonsanchez.com/how-to/2012/06/17/PCA-in-R/

5 functions to do Principal Components Analysis in R

17 Jun 2012

Principal Component Analysis (PCA) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data.

From a data analysis standpoint, PCA is used for studying one table of observations and variables with the main idea of transforming the observed variables into a set of new variables, the principal components, which are uncorrelated and explain the variation in the data. For this reason, PCA allows to reduce a “complex” data set to a lower dimension in order to reveal the structures or the dominant types of variations in both the observations and the variables.

PCA in R

In R, there are several functions from different packages that allow us to perform PCA. In this post I’ll show you 5 different ways to do a PCA using the following functions (with their corresponding packages in parentheses):

prcomp() (stats)
princomp() (stats)
PCA() (FactoMineR)
dudi.pca() (ade4)
acp() (amap)

Brief note: It is no coincidence that the three external packages ("FactoMineR", "ade4", and "amap") have been developed by French data analysts, which have a long tradition and preference for PCA and other related exploratory techniques.

No matter what function you decide to use, the typical PCA results should consist of a set of eigenvalues, a table with the scores or Principal Components (PCs), and a table of loadings (or correlations between variables and PCs). The eigenvalues provide information of the variability in the data. The scores provide information about the structure of the observations. The loadings (or correlations) allow you to get a sense of the relationships between variables, as well as their associations with the extracted PCs.

The Data

To make things easier, we’ll use the dataset USArrests that already comes with R. It’s a data frame with 50 rows (USA states) and 4 columns containing information about violent crime rates by US State. Since most of the times the variables are measured in different scales, the PCA must be performed with standardized data (mean = 0, variance = 1). The good news is that all of the functions that perform PCA come with parameters to specify that the analysis must be applied on standardized data.

Option 1: using prcomp()

The function prcomp() comes with the default "stats"package, which means that you don’t have to install anything. It is perhaps the quickest way to do a PCA if you don’t want to install other packages.

# PCA with function prcomp
pca1 = prcomp(USArrests, scale. = TRUE)

# sqrt of eigenvalues
pca1$sdev

## [1] 1.5749 0.9949 0.5971 0.4164

# loadings
head(pca1$rotation)

##              PC1     PC2     PC3      PC4
## Murder   -0.5359  0.4182 -0.3412  0.64923
## Assault  -0.5832  0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780  0.13388
## Rape     -0.5434 -0.1673  0.8178  0.08902

# PCs (aka scores)
head(pca1$x)

##                PC1     PC2      PC3      PC4
## Alabama    -0.9757  1.1220 -0.43980  0.15470
## Alaska     -1.9305  1.0624  2.01950 -0.43418
## Arizona    -1.7454 -0.7385  0.05423 -0.82626
## Arkansas    0.1400  1.1085  0.11342 -0.18097
## California -2.4986 -1.5274  0.59254 -0.33856
## Colorado   -1.4993 -0.9776  1.08400  0.00145

Option 2: using princomp()

The function princomp() also comes with the default "stats" package, and it is very similar to her cousin prcomp(). What I don’t like of princomp() is that sometimes it won’t display all the values for the loadings, but this is a minor detail.

# PCA with function princomp
pca2 = princomp(USArrests, cor = TRUE)

# sqrt of eigenvalues
pca2$sdev

## Comp.1 Comp.2 Comp.3 Comp.4 
## 1.5749 0.9949 0.5971 0.4164

# loadings
unclass(pca2$loadings)

##           Comp.1  Comp.2  Comp.3   Comp.4
## Murder   -0.5359  0.4182 -0.3412  0.64923
## Assault  -0.5832  0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780  0.13388
## Rape     -0.5434 -0.1673  0.8178  0.08902

# PCs (aka scores)
head(pca2$scores)

##             Comp.1  Comp.2   Comp.3    Comp.4
## Alabama    -0.9856  1.1334 -0.44427  0.156267
## Alaska     -1.9501  1.0732  2.04000 -0.438583
## Arizona    -1.7632 -0.7460  0.05478 -0.834653
## Arkansas    0.1414  1.1198  0.11457 -0.182811
## California -2.5240 -1.5429  0.59856 -0.341996
## Colorado   -1.5146 -0.9876  1.09501  0.001465

Option 3: using PCA()

A highly recommended option, especially if you want more detailed results and assessing tools, is the PCA() function from the package "FactoMineR". It is by far the best PCA function in R and it comes with a number of parameters that allow you to tweak the analysis in a very nice way.

# PCA with function PCA
library(FactoMineR)

# apply PCA
pca3 = PCA(USArrests, graph = FALSE)

# matrix with eigenvalues
pca3$eig

##        eigenvalue percentage of variance cumulative percentage of variance
## comp 1     2.4802                 62.006                             62.01
## comp 2     0.9898                 24.744                             86.75
## comp 3     0.3566                  8.914                             95.66
## comp 4     0.1734                  4.336                            100.00

# correlations between variables and PCs
pca3$var$coord

##           Dim.1   Dim.2   Dim.3    Dim.4
## Murder   0.8440 -0.4160  0.2038  0.27037
## Assault  0.9184 -0.1870  0.1601 -0.30959
## UrbanPop 0.4381  0.8683  0.2257  0.05575
## Rape     0.8558  0.1665 -0.4883  0.03707

# PCs (aka scores)
head(pca3$ind$coord)

##              Dim.1   Dim.2    Dim.3     Dim.4
## Alabama     0.9856 -1.1334  0.44427  0.156267
## Alaska      1.9501 -1.0732 -2.04000 -0.438583
## Arizona     1.7632  0.7460 -0.05478 -0.834653
## Arkansas   -0.1414 -1.1198 -0.11457 -0.182811
## California  2.5240  1.5429 -0.59856 -0.341996
## Colorado    1.5146  0.9876 -1.09501  0.001465

Option 4: using dudi.pca()

Another option is to use the dudi.pca() function from the package "ade4" which has a huge amount of other methods as well as some interesting graphics.

# PCA with function dudi.pca
library(ade4)

# apply PCA
pca4 = dudi.pca(USArrests, nf = 5, scannf = FALSE)

# eigenvalues
pca4$eig

## [1] 2.4802 0.9898 0.3566 0.1734

# loadings
pca4$c1

##              CS1     CS2     CS3      CS4
## Murder   -0.5359  0.4182 -0.3412  0.64923
## Assault  -0.5832  0.1880 -0.2681 -0.74341
## UrbanPop -0.2782 -0.8728 -0.3780  0.13388
## Rape     -0.5434 -0.1673  0.8178  0.08902

# correlations between variables and PCs
pca4$co

##            Comp1   Comp2   Comp3    Comp4
## Murder   -0.8440  0.4160 -0.2038  0.27037
## Assault  -0.9184  0.1870 -0.1601 -0.30959
## UrbanPop -0.4381 -0.8683 -0.2257  0.05575
## Rape     -0.8558 -0.1665  0.4883  0.03707

# PCs
head(pca4$li)

##              Axis1   Axis2    Axis3     Axis4
## Alabama    -0.9856  1.1334 -0.44427  0.156267
## Alaska     -1.9501  1.0732  2.04000 -0.438583
## Arizona    -1.7632 -0.7460  0.05478 -0.834653
## Arkansas    0.1414  1.1198  0.11457 -0.182811
## California -2.5240 -1.5429  0.59856 -0.341996
## Colorado   -1.5146 -0.9876  1.09501  0.001465

Option 5: using acp()

A fifth possibility is the acp() function from the package "amap".

# PCA with function acp
library(amap)

# apply PCA
pca5 = acp(USArrests)

# sqrt of eigenvalues
pca5$sdev

## Comp 1 Comp 2 Comp 3 Comp 4 
## 1.5749 0.9949 0.5971 0.4164

# loadings
pca5$loadings

##          Comp 1  Comp 2  Comp 3   Comp 4
## Murder   0.5359  0.4182 -0.3412  0.64923
## Assault  0.5832  0.1880 -0.2681 -0.74341
## UrbanPop 0.2782 -0.8728 -0.3780  0.13388
## Rape     0.5434 -0.1673  0.8178  0.08902

# scores
head(pca5$scores)

##             Comp 1  Comp 2   Comp 3   Comp 4
## Alabama     0.9757  1.1220 -0.43980  0.15470
## Alaska      1.9305  1.0624  2.01950 -0.43418
## Arizona     1.7454 -0.7385  0.05423 -0.82626
## Arkansas   -0.1400  1.1085  0.11342 -0.18097
## California  2.4986 -1.5274  0.59254 -0.33856
## Colorado    1.4993 -0.9776  1.08400  0.00145

Of course these are not the only options to do a PCA, but I’ll leave the other approaches for another post.

PCA plots

Everybody uses PCA to visualize the data, and most of the discussed functions come with their own plot functions. But you can also make use of the great graphical displays of "ggplot2". Just to show you a couple of plots, let’s take the basic results from prcomp().

Plot of observations

# load ggplot2
library(ggplot2)

# create data frame with scores
scores = as.data.frame(pca1$x)

# plot of observations
ggplot(data = scores, aes(x = PC1, y = PC2, label = rownames(scores))) +
  geom_hline(yintercept = 0, colour = "gray65") +
  geom_vline(xintercept = 0, colour = "gray65") +
  geom_text(colour = "tomato", alpha = 0.8, size = 4) +
  ggtitle("PCA plot of USA States - Crime Rates")

center

Circle of correlations

# function to create a circle
circle <- function(center = c(0, 0), npoints = 100) {
    r = 1
    tt = seq(0, 2 * pi, length = npoints)
    xx = center[1] + r * cos(tt)
    yy = center[1] + r * sin(tt)
    return(data.frame(x = xx, y = yy))
}
corcir = circle(c(0, 0), npoints = 100)

# create data frame with correlations between variables and PCs
correlations = as.data.frame(cor(USArrests, pca1$x))

# data frame with arrows coordinates
arrows = data.frame(x1 = c(0, 0, 0, 0), y1 = c(0, 0, 0, 0), x2 = correlations$PC1, 
    y2 = correlations$PC2)

# geom_path will do open circles
ggplot() + geom_path(data = corcir, aes(x = x, y = y), colour = "gray65") + 
    geom_segment(data = arrows, aes(x = x1, y = y1, xend = x2, yend = y2), colour = "gray65") + 
    geom_text(data = correlations, aes(x = PC1, y = PC2, label = rownames(correlations))) + 
    geom_hline(yintercept = 0, colour = "gray65") + geom_vline(xintercept = 0, 
    colour = "gray65") + xlim(-1.1, 1.1) + ylim(-1.1, 1.1) + labs(x = "pc1 aixs", 
    y = "pc2 axis") + ggtitle("Circle of correlations")

center

저작자표시 비영리 동일조건

'통계 > R' 카테고리의 다른 글

Mac에서 RStudio에서 txt파일 불러올 때 한글 깨지는 문제 해결 in R (1)	2017.09.05
princomp함수를 이용한 주성분분석 (0)	2016.11.07
[R] data frame에서 행 랜덤표본추출 방법 (0)	2016.11.07
R studio 단축키 (0)	2016.10.09
[R] 거리행렬로 군집분석 / distance matrix clustering (0)	2015.06.19

Posted by Azel.Kim :

1 2 3

Kim's Private Library RSS FEED

카테고리

태그목록

최근에 올라온 글

최근에 달린 댓글

'통계'에 해당되는 글 11건

Mac에서 RStudio에서 txt파일 불러올 때 한글 깨지는 문제 해결 in R

'통계 > R' 카테고리의 다른 글

머신러닝?

추천시스템 (공부중 정리)

princomp함수를 이용한 주성분분석

'통계 > R' 카테고리의 다른 글

[R] PCA 주성분분석 5가지 함수

5 functions to do Principal Components Analysis in R

PCA in R

The Data

Option 1: using prcomp()

Option 2: using princomp()

Option 3: using PCA()

Option 4: using dudi.pca()

Option 5: using acp()

PCA plots

Plot of observations

Circle of correlations

'통계 > R' 카테고리의 다른 글

티스토리툴바