Description
Hierarchical cluster analysis on a set of dissimilarities and methods
for analyzing it.
Usage
hclust(d, method = “complete”, members = NULL)
S3 method for class ‘hclust’
plot(x, labels = NULL, hang = 0.1, check = TRUE, axes = TRUE, frame.plot
= FALSE,
ann = TRUE, main = “Cluster Dendrogram”, sub = NULL, xlab = NULL, ylab =
“Height”, …)
Arguments
d a dissimilarity structure as produced by dist.
method the agglomeration method to be used. This should be (an
unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”,
“complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (=
WPGMC) or “centroid” (= UPGMC).
members NULL or a vector with length size of d. See the
‘Details’ section.
x an object of the type produced by hclust.
hang The fraction of the plot height by which labels should
hang below the rest of the plot. A negative value will cause the labels
to hang down from 0.
check logical indicating if the x object should be checked for
validity. This check is not necessary when x is known to be valid such
as when it is the direct result of hclust(). The default is check=TRUE,
as invalid inputs may crash R due to memory violation in the internal C
plotting code.
labels A character vector of labels for the leaves of the tree.
By default the row names or row numbers of the original data are used.
If labels = FALSE no labels at all are plotted.
axes, frame.plot, ann logical flags as in plot.default.
main, sub, xlab, ylab character strings for title, sub and xlab
have a non-NULL default when there’s a tree$call.
… Further graphical arguments. E.g., cex controls the size of the labels
(if plotted) in the same way as text.
Details This function performs a hierarchical cluster analysis using a set of dissimilarities for the nn objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. At each stage distances between clusters are recomputed by the Lance–Williams dissimilarity update formula according to the particular clustering method being used.
A number of different clustering methods are provided. Ward’s minimum variance method aims at finding compact, spherical clusters. The complete linkage method finds similar clusters. The single linkage method (which is closely related to the minimal spanning tree) adopts a ‘friends of friends’ clustering strategy. The other methods can be regarded as aiming for clusters with characteristics somewhere between the single and complete link methods. Note however, that methods “median” and “centroid” are not leading to a monotone distance measure, or equivalently the resulting dendrograms can have so called inversions or reversals which are hard to interpret, but note the trichotomies in Legendre and Legendre (2012).
Two different algorithms are found in the literature for Ward clustering. The one used by option “ward.D” (equivalent to the only Ward option “ward” in R versions ≤ 3.0.3) does not implement Ward’s (1963) clustering criterion, whereas option “ward.D2” implements that criterion (Murtagh and Legendre 2014). With the latter, the dissimilarities are squared before cluster updating. Note that agnes(, method=“ward”) corresponds to hclust(, “ward.D2”).
If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.
In hierarchical cluster displays, a decision is needed at each merge to specify which subtree should go on the left and which on the right. Since, for \(n\) observations there are \(n-1\) merges, there are \(2^{(n-1)}\) possible orderings for the leaves in a cluster tree, or dendrogram. The algorithm used in hclust is to order the subtree so that the tighter cluster is on the left (the last, i.e., most recent, merge of the left subtree is at a lower value than the last merge of the right subtree). Single observations are the tightest clusters possible, and merges involving two observations place them in order by their observation sequence number.
Value An object of class hclust which describes the
tree produced by the clustering process. The object is a list with
components:
merge an \(n-1 \times 2\)
matrix. Row \(i\) of merge describes
the merging of clusters at step \(i\)
of the clustering. If an element \(j\)
in the row is negative, then observation \(-j\) was merged at this stage. If \(j\) is positive then the merge was with the
cluster formed at the (earlier) stage \(j\) of the algorithm. Thus negative entries
in merge indicate agglomerations of singletons, and positive entries
indicate agglomerations of non-singletons.
height a set of \(n−1\) real
values (non-decreasing for ultrametric trees). The clustering height:
that is, the value of the criterion associated with the clustering
method for the particular agglomeration.
order a vector giving the permutation of the original
observations suitable for plotting, in the sense that a cluster plot
using this ordering and matrix merge will not have crossings of the
branches.
labels labels for each of the objects being clustered.
call the call which produced the result.
method the cluster method that has been used.
dist.method the distance that has been used to create \(d\) (only returned if the distance object
has a “method” attribute).
There are print, plot and identify (see identify.hclust) methods and the rect.hclust() function for hclust objects.
Note Method “centroid” is typically meant to be used with squared Euclidean distances.
Author(s)
The hclust function is based on Fortran code contributed to STATLIB by
F. Murtagh.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S
Language. Wadsworth & Brooks/Cole. (S version.)
Everitt, B. (1974). Cluster Analysis. London: Heinemann Educ. Books.
Hartigan, J.A. (1975). Clustering Algorithms. New York: Wiley.
Sneath, P. H. A. and R. R. Sokal (1973). Numerical Taxonomy. San Francisco: Freeman.
Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press: New York.
Gordon, A. D. (1999). Classification. Second Edition. London: Chapman and Hall / CRC
Murtagh, F. (1985). “Multidimensional Clustering Algorithms”, in COMPSTAT Lectures 4. Wuerzburg: Physica-Verlag (for algorithmic details of algorithms used).
McQuitty, L.L. (1966). Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological Measurement, 26, 825–831. doi:10.1177/001316446602600402.
Legendre, P. and L. Legendre (2012). Numerical Ecology, 3rd English ed. Amsterdam: Elsevier Science BV.
Murtagh, Fionn and Legendre, Pierre (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? Journal of Classification, 31, 274–295. doi:10.1007/s00357-014-9161-z.\
See Also: identify.hclust, rect.hclust, cutree, dendrogram,
kmeans.
For the Lance–Williams formula and methods that apply it generally, see agnes from package cluster.
Examples Run examples
hclust con distancia euclidiana y método promedio
require(graphics) # se necesita graphics
hc <- hclust(dist(USArrests), "ave") # clasificación
plot(hc) # dendrogramas
plot(hc, hang = -1) # aliniado
hclust con distancia euclidiana al cuadrado y método baricentro
## Do the same with centroid clustering and *squared* Euclidean distance,
## cut the tree into ten clusters and reconstruct the upper part of the
## tree from the cluster centers.
hc <- hclust(dist(USArrests)^2, "cen")
memb <- cutree(hc, k = 10) # corta el arbol
cent <- NULL
for(k in 1:10){
cent <- rbind(cent, colMeans(USArrests[memb == k, , drop = FALSE])) # promedios de cada clase
}
# clasificación a partir de diez baricentros
hc1 <- hclust(dist(cent)^2, method = "cen", members = table(memb))
opar <- par(mfrow = c(1, 2)) # dos plots a lado
plot(hc, labels = FALSE, hang = -1, main = "Original Tree")
plot(hc1, labels = FALSE, hang = -1, main = "Re-start from 10 clusters")
par(opar) # reestablece parametros
Example 2: Straight-line distances among 10 US cities
Compare the results of algorithms “ward.D” and “ward.D2”
Nótese que con cmdscale() se construyen coordenadas para la representación gráfica de las ciudades.
mds2 <- -cmdscale(UScitiesD) # mds coidades
plot(mds2, type="n", axes=FALSE, ann=FALSE) # plot vacío
text(mds2, labels=rownames(mds2), xpd = NA) # solo etiquetas
# se comparan los dos ward
hcity.D <- hclust(UScitiesD, "ward.D") # "wrong"
hcity.D2 <- hclust(UScitiesD, "ward.D2") # good
opar <- par(mfrow = c(1, 2))
plot(hcity.D, hang=-1)
plot(hcity.D2, hang=-1)
par(opar)
Aquí se va testando una clasificación jerarquica con hclust. Pero se necesita de calcular la matriz de distancias previamente.
Intentamos con los datos de Linnerud.
linnerud <- read.csv("Linnerud.csv",header = TRUE, row.names = 1)
n <- dim(linnerud)[1]
lin.dis <- dist(linnerud);round(lin.dis,3) # distancia euclidiana
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 2 52.173
## 3 74.317 43.761
## 4 69.376 38.562 71.400
## 5 11.747 46.797 70.250 62.666
## 6 64.521 21.794 60.614 23.281 58.335
## 7 68.220 33.196 65.673 49.790 62.722 29.614
## 8 49.497 34.612 70.915 21.794 44.204 28.792 50.517
## 9 52.631 96.737 118.849 96.948 57.637 101.543 107.014 77.408
## 10 213.492 239.714 215.119 258.484 217.773 258.314 266.081 245.526 217.984
## 11 53.666 34.957 70.788 21.095 45.365 27.477 47.518 15.937 83.917 249.734
## 12 77.660 117.009 113.437 131.263 82.722 132.559 141.071 113.864 79.480 141.711 118.571
## 13 80.523 120.715 120.785 129.603 85.790 133.854 144.395 112.134 71.021 149.650 118.085 20.372
## 14 126.079 84.558 91.471 103.947 121.384 83.875 64.730 111.328 169.228 299.184 107.369 191.580 198.514
## 15 96.571 50.100 77.750 50.150 89.554 36.152 37.974 63.103 134.759 287.181 57.149 165.768 168.565 61.790
## 16 78.861 118.191 111.086 139.707 85.434 135.890 136.890 122.061 85.820 144.783 126.850 37.921 50.705 181.304 167.018
## 17 108.959 62.466 88.470 49.970 102.137 44.844 55.642 67.676 142.766 296.275 63.340 175.502 176.403 76.629 22.226 179.803
## 18 78.968 126.214 135.834 132.699 84.747 137.080 146.226 113.217 57.983 171.473 118.541 41.364 31.875 204.206 171.388 64.428 179.900
## 19 74.081 121.128 132.586 125.734 79.284 130.947 140.371 106.433 50.180 178.933 111.535 45.793 35.071 199.539 165.263 68.110 173.396 9.798
## 20 78.384 56.232 81.823 27.875 73.892 46.637 75.100 34.000 98.803 251.704 40.447 127.848 123.637 126.586 72.650 142.046 66.963 128.328 121.713
lin.clust <- hclust(lin.dis,method="ward.D2"); lin.clust # clasificación
##
## Call:
## hclust(d = lin.dis, method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 20
str(lin.clust) # contenido salida
## List of 7
## $ merge : int [1:19, 1:2] -18 -1 -8 -12 -2 -15 -4 -7 -20 -16 ...
## $ height : num [1:19] 9.8 11.7 15.9 20.4 21.8 ...
## $ order : int [1:20] 14 15 17 1 5 20 4 8 11 3 ...
## $ labels : chr [1:20] "1" "2" "3" "4" ...
## $ method : chr "ward.D2"
## $ call : language hclust(d = lin.dis, method = "ward.D2")
## $ dist.method: chr "euclidean"
## - attr(*, "class")= chr "hclust"
plot(lin.clust) # dendrograma
plot(lin.clust,hang=-1,main="Datos de Linnerud") # dendrograma mejor
hclust brinda muchas salidas: se les vamos a salir todas:
noquote("contenido de hclust")
## [1] contenido de hclust
summary(lin.clust)
## Length Class Mode
## merge 38 -none- numeric
## height 19 -none- numeric
## order 20 -none- numeric
## labels 20 -none- character
## method 1 -none- character
## call 3 -none- call
## dist.method 1 -none- character
noquote("pareja de nodos que se juntan")
## [1] pareja de nodos que se juntan
lin.clust$merge
## [,1] [,2]
## [1,] -18 -19
## [2,] -1 -5
## [3,] -8 -11
## [4,] -12 -13
## [5,] -2 -6
## [6,] -15 -17
## [7,] -4 3
## [8,] -7 5
## [9,] -20 7
## [10,] -16 4
## [11,] -9 1
## [12,] -3 8
## [13,] -14 6
## [14,] 9 12
## [15,] 10 11
## [16,] 2 14
## [17,] 13 16
## [18,] -10 15
## [19,] 17 18
noquote("índice de la jerarquía")
## [1] índice de la jerarquía
lin.clust$height
## [1] 9.797959 11.747340 15.937377 20.371549 21.794495 22.226111 22.992752 34.073450 39.860172 50.342163 62.353829 67.394362 79.343137 85.195364 87.882497 94.710348 144.225589 217.155419 401.177895
noquote("orden horizontal de las hojas")
## [1] orden horizontal de las hojas
lin.clust$order
## [1] 14 15 17 1 5 20 4 8 11 3 7 2 6 10 16 12 13 9 18 19
noquote("etiquetas de las unidades en orden original")
## [1] etiquetas de las unidades en orden original
lin.clust$labels
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
noquote("método de aglomeración")
## [1] método de aglomeración
lin.clust$method
## [1] "ward.D2"
noquote("comando")
## [1] comando
lin.clust$call
## hclust(d = lin.dis, method = "ward.D2")
noquote("tipo de distancia empleado")
## [1] tipo de distancia empleado
lin.clust$dist.method
## [1] "euclidean"
noquote("hacemos todas las particiones")
## [1] hacemos todas las particiones
part <- cutree(lin.clust, k = 1:(n-1)) # particiones
part
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 3 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## 4 1 1 1 1 2 2 3 3 4 4 4 4 4 4 4 4 4 4 4
## 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5
## 6 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 5 5 5 6
## 7 1 1 1 1 2 2 2 2 2 2 2 2 5 5 5 6 6 6 7
## 8 1 1 1 1 2 2 3 3 4 4 4 4 4 6 6 7 7 7 8
## 9 1 2 2 2 3 3 4 4 5 5 5 5 6 7 7 8 8 8 9
## 10 1 2 3 3 4 4 5 5 6 6 6 6 7 8 8 9 9 9 10
## 11 1 1 1 1 2 2 3 3 4 4 4 4 4 6 6 7 7 10 11
## 12 1 2 2 2 3 5 6 6 7 7 7 7 8 9 9 10 10 11 12
## 13 1 2 2 2 3 5 6 6 7 7 7 7 8 9 9 10 11 12 13
## 14 1 1 1 4 5 6 7 7 8 8 8 8 9 10 10 11 12 13 14
## 15 1 1 1 4 5 6 7 8 9 9 9 9 10 11 11 12 13 14 15
## 16 1 2 2 2 3 5 6 6 7 7 10 10 11 12 12 13 14 15 16
## 17 1 1 1 4 5 6 7 8 9 9 9 9 10 11 13 14 15 16 17
## 18 1 2 2 2 3 3 4 4 5 10 11 11 12 13 14 15 16 17 18
## 19 1 2 2 2 3 3 4 4 5 10 11 11 12 13 14 15 16 17 18
## 20 1 1 1 1 2 2 3 3 4 4 4 12 13 14 15 16 17 18 19
En cada columna se encuentra la indicación de la clase por cada fila.
Vemos ahora de organizar esta salida de manera apropiada, construyendo una tabla de valores por cada nivel de la jerarquía:
data.hcl <- lin.clust # salida hclust
maxn <- n-1 # tamaño arbol
index <- lin.clust$height # índice jerarquía
Di <- c(index[2:maxn]-index[1:(maxn-1)],0) # 1 derivada
Di2 <- c(Di[2:maxn]-Di[1:(maxn-1)],0) # 2 derivada
index <- cbind(c((maxn):1),lin.clust$merge,index,Di,Di2) # tabla
rownames(index) <- c(1:maxn)
colnames(index) <- c("ngr","g1","g2","index","D","D2")
dindex <- as.data.frame(cbind(index[,1:3],round(index[,4:6],4)))
dindex
## ngr g1 g2 index D D2
## 1 19 -18 -19 9.7980 1.9494 2.2407
## 2 18 -1 -5 11.7473 4.1900 0.2441
## 3 17 -8 -11 15.9374 4.4342 -3.0112
## 4 16 -12 -13 20.3715 1.4229 -0.9913
## 5 15 -2 -6 21.7945 0.4316 0.3350
## 6 14 -15 -17 22.2261 0.7666 10.3141
## 7 13 -4 3 22.9928 11.0807 -5.2940
## 8 12 -7 5 34.0735 5.7867 4.6953
## 9 11 -20 7 39.8602 10.4820 1.5297
## 10 10 -16 4 50.3422 12.0117 -6.9711
## 11 9 -9 1 62.3538 5.0405 6.9082
## 12 8 -3 8 67.3944 11.9488 -6.0965
## 13 7 -14 6 79.3431 5.8522 -3.1651
## 14 6 9 12 85.1954 2.6871 4.1407
## 15 5 10 11 87.8825 6.8279 42.6874
## 16 4 2 14 94.7103 49.5152 23.4146
## 17 3 13 16 144.2256 72.9298 111.0926
## 18 2 -10 15 217.1554 184.0225 -184.0225
## 19 1 17 18 401.1779 0.0000 0.0000
daplot <- dindex[maxn:1,4:6]
methods <- paste("Dissimilarity:",data.hcl$dist.method,
", agglomeration:",data.hcl$method)
########
plot(1:maxn, daplot[,1], type="l", xlab="Number of Clusters", # gráfico
ylab="Index", main = "K-means inertia variation",sub=methods)
lines(1:maxn, daplot[,2],col="blue")
lines(1:maxn, daplot[,3],col="green")
legend("topright",legend=c("Index","Differences",
"Second differences"),
col=c("black","blue","green"),lty=1,cex=0.75)
#######
De esto también se puede hacer una función
stat.hclust <- function(hclust) {
n <- length(hclust$labels) # número unidades
maxn <- n-1 # tamaño arbol
index <- hclust$height # índice jerarquía
Di <- c(index[2:maxn]-index[1:(maxn-1)],0) # 1 derivada
Di2 <- c(Di[2:maxn]-Di[1:(maxn-1)],0) # 2 derivada
index <- cbind(c((maxn):1),hclust$merge,index,Di,Di2) # tabla
rownames(index) <- c(1:maxn) # etiquetas
colnames(index) <- c("ngr","g1","g2","index","D","D2")
dindex <- as.data.frame(cbind(index[,1:3],round(index[,4:6],4))) # salida
daplot <- index[maxn:1,4:6] # salida para plot
methods <- paste("Dissimilarity:",hclust$dist.method, # títulos plot
", agglomeration:",hclust$method)
########
plot(1:maxn, daplot[,1], type="l", xlab="Number of Clusters",
ylim = range(daplot[,1:3]), # gráfico
ylab="Index", main = "Statistics of hierarchy indices",sub=methods)
lines(1:maxn, daplot[,2],col="blue") # dos derivadas
lines(1:maxn, daplot[,3],col="green")
legend("topright",legend=c("Index","Differences", # leyenda
"Second differences"),col=c("black","blue","green"),lty=1,cex=0.75)
#######
dindex
}
Se resulta el siguente:
stat.hclust(lin.clust)
## ngr g1 g2 index D D2
## 1 19 -18 -19 9.7980 1.9494 2.2407
## 2 18 -1 -5 11.7473 4.1900 0.2441
## 3 17 -8 -11 15.9374 4.4342 -3.0112
## 4 16 -12 -13 20.3715 1.4229 -0.9913
## 5 15 -2 -6 21.7945 0.4316 0.3350
## 6 14 -15 -17 22.2261 0.7666 10.3141
## 7 13 -4 3 22.9928 11.0807 -5.2940
## 8 12 -7 5 34.0735 5.7867 4.6953
## 9 11 -20 7 39.8602 10.4820 1.5297
## 10 10 -16 4 50.3422 12.0117 -6.9711
## 11 9 -9 1 62.3538 5.0405 6.9082
## 12 8 -3 8 67.3944 11.9488 -6.0965
## 13 7 -14 6 79.3431 5.8522 -3.1651
## 14 6 9 12 85.1954 2.6871 4.1407
## 15 5 10 11 87.8825 6.8279 42.6874
## 16 4 2 14 94.7103 49.5152 23.4146
## 17 3 13 16 144.2256 72.9298 111.0926
## 18 2 -10 15 217.1554 184.0225 -184.0225
## 19 1 17 18 401.1779 0.0000 0.0000
En el caso de la distancia euclidiana se puede hacer un tratamiento parecido a lo hecho por k-means.
En primero lugar, se necesita de introducir dos funciones por el cálculo de las inercias.
####### esta función calcula la suma de cuadrados centrados de un data.frame
css <- function(data) { # Centered sum of squares of a data.frame
sum(scale(data, center = TRUE, scale = FALSE)^2)
}
####### esta función calcul la suma de cuadratos dentro clases de una partición en hclust
wrap <- function(i, hc, x) { # i = # of clusters, hc = hclust, x = data
cl <- cutree(hc, i) # gets the partition in i groups
spl <- split(x, cl) # splits the data in classes
wss <- sum(sapply(spl, css)) # computes the wss
wss
}
################################################################################
Luego se construye una tabla parecida, pero con inercias.
data <- linnerud # datos de ejemplo
hclust <- lin.clust # salida hclust
n <- length(hclust$labels) # número unidades
maxn <- n-1 # tamaño arbol
index <- hclust$height # índice jerarquía
ast <- rep("",maxn) # max
wss <- rep(0,maxn) # within ss
wssd <- wss # 1 derivadas
wssd2 <- wss # 2 derivadas
indx <- hclust$height # indice
tss <- css(data)
for (i in 1:maxn) {
wss[i] <- wrap((maxn-i+1),hclust,data) # llena wss
}
bss <- tss - wss # between ss
wssd <- c(wss[2:maxn] - wss[1:(maxn-1)],0) # 1 derivada
wssd2 <- c(wssd[2:maxn] - wssd[1:(maxn-1)],0) # 2 derivada
df1 <- c(maxn:1)-1 # grados libertad
df2 <- (n-df1+1)
CH <- (bss/df1)/(wss/df2) # índice de Calinski
CH[maxn] <- 0 # fix NaN
for (i in 2:(maxn-1)) { # detect
if ((CH[i] > CH[i-1]) & (CH[i] > CH[i+1])) { # local
ast[i] <- "*" # maxima
}
}
index <- cbind(index,bss,wss,wssd,wssd2,CH) # tabla
dindex <- data.frame(c(maxn:1),hclust$merge,round(index,4),ast) # completa
colnames(dindex) <- c("ngr","g1","g2","index","bss","wss", # etiquetas
"Dwss","D2wss","CH","max")
rownames(dindex) <- c(1:maxn)
dindex
## ngr g1 g2 index bss wss Dwss D2wss CH max
## 1 19 -18 -19 9.7980 137583.50 48.000 69.0000 58.0000 477.7205
## 2 18 -1 -5 11.7473 137514.50 117.000 127.0000 80.5000 276.5500
## 3 17 -8 -11 15.9374 137387.50 244.000 207.5000 30.0000 175.9574
## 4 16 -12 -13 20.3715 137180.00 451.500 237.5000 9.5000 121.5327
## 5 15 -2 -6 21.7945 136942.50 689.000 247.0000 17.3333 99.3777
## 6 14 -15 -17 22.2261 136695.50 936.000 264.3333 316.1667 89.8721
## 7 13 -4 3 22.9928 136431.17 1200.333 580.5000 213.9167 85.2458
## 8 12 -7 5 34.0735 135850.67 1780.833 794.4167 472.7500 69.3499
## 9 11 -20 7 39.8602 135056.25 2575.250 1267.1667 676.8333 57.6883
## 10 10 -16 4 50.3422 133789.08 3842.417 1944.0000 327.0000 46.4253
## 11 9 -9 1 62.3538 131845.08 5786.417 2271.0000 876.6667 37.0261
## 12 8 -3 8 67.3944 129574.08 8057.417 3147.6667 481.4583 32.1627
## 13 7 -14 6 79.3431 126426.42 11205.083 3629.1250 232.5417 28.2074
## 14 6 9 12 85.1954 122797.29 14834.208 3861.6667 623.3583 26.4895
## 15 5 10 11 87.8825 118935.62 18695.875 4485.0250 5915.4853 27.0368
## 16 4 2 14 94.7103 114450.60 23180.900 10400.5103 13177.7278 29.6237 *
## 17 3 13 16 144.2256 104050.09 33581.410 23578.2381 56893.6136 29.4352
## 18 2 -10 15 217.1554 80471.85 57159.648 80471.8516 -80471.8516 28.1569
## 19 1 17 18 401.1779 0.00 137631.500 0.0000 0.0000 0.0000
Para hacer un plot de las inercias, se necesita de cambiar el orden en los vectores.
daplot <- dindex[maxn:1,c(5:10)]
methods <- paste("Dissimilarity:",hclust$dist.method, # títulos plot
", agglomeration:",hclust$method)
std.par <- par(mar=c(5, 4, 4, 4)+.1) # par
plot(1:maxn, daplot[,1], type="l", col="violet", xlab="Number of Clusters",
ylim = range(daplot[,1:4]),
ylab="Inertia", main = "Statistics on inertia variation",sub = methods)
lines(1:maxn, daplot[,2],col="black")
lines(1:maxn, daplot[,3],col="blue")
lines(1:maxn, daplot[,4],col="green")
legend("topright",legend=c("Between inertia","Within inertia","Differences",
"Second differences","Calinski index"),
col=c("violet","black","blue","green","red"),lty=1,cex=0.75)
par(new = TRUE)
plot(1:maxn, daplot[,5], type = "l", axes = FALSE, bty = "n", xlab = "",
ylab = "",col="red",ylim=c(range(daplot[,5])*1.5))
text(1:maxn,daplot[,5], labels=daplot[,6],pos=3,col="red",cex=0.75,offset=0.01)
axis(side=4, at = pretty(range(daplot[,5])))
mtext("Calinski-Harabász Index", side=4, line=3)
par(std.par)
De esto también se hace una función:
statiner.hclust <- function(data,hclust) {
n <- length(hclust$labels) # número unidades
maxn <- n-1 # tamaño arbol
index <- hclust$height # índice jerarquía
ast <- rep("",maxn) # max
wss <- rep(0,maxn) # within ss
wssd <- wss # 1 derivadas
wssd2 <- wss # 2 derivadas
indx <- hclust$height # indice
tss <- css(data)
for (i in 1:maxn) {
wss[i] <- wrap((maxn-i+1),hclust,data) # llena wss
}
bss <- tss - wss # between ss
wssd <- c(wss[2:maxn] - wss[1:(maxn-1)],0) # 1 derivada
wssd2 <- c(wssd[2:maxn] - wssd[1:(maxn-1)],0) # 2 derivada
df1 <- c(maxn:1)-1 # grados libertad
df2 <- (n-df1+1)
CH <- (bss/df1)/(wss/df2) # índice de Calinski
CH[maxn] <- 0 # fix NaN
for (i in 2:(maxn-1)) { # detect
if ((CH[i] > CH[i-1]) & (CH[i] > CH[i+1])) { # local
ast[i] <- "*" # maxima
}
}
index <- cbind(index,bss,wss,wssd,wssd2,CH) # tabla
dindex <- data.frame(c(maxn:1),hclust$merge,round(index,4),ast) # completa
colnames(dindex) <- c("ngr","g1","g2","index","bss","wss", # etiquetas
"Dwss","D2wss","CH","max")
rownames(dindex) <- c(1:maxn)
##########
daplot <- dindex[maxn:1,c(5:10)] # datos por plot
methods <- paste("Dissimilarity:",hclust$dist.method, # títulos plot
", agglomeration:",hclust$method)
std.par <- par(mar=c(5, 4, 4, 4)+.1) # par
plot(1:maxn, daplot[,1], type="l", col="violet", xlab="Number of Clusters",
ylim = range(daplot[,1:4]),
ylab="Inertia", main = "Statistics on inertia variation",sub = methods)
lines(1:maxn, daplot[,2],col="black")
lines(1:maxn, daplot[,3],col="blue")
lines(1:maxn, daplot[,4],col="green")
legend("topright",legend=c("Between inertia","Within inertia","Differences",
"Second differences","Calinski index"),
col=c("violet","black","blue","green","red"),lty=1,cex=0.75)
par(new = TRUE)
plot(1:maxn, daplot[,5], type = "l", axes = FALSE, bty = "n", xlab = "",
ylab = "",col="red",ylim=c(range(daplot[,5])*1.5))
text(1:maxn,daplot[,5], labels=daplot[,6],pos=3,col="red",cex=0.75,offset=0.01)
axis(side=4, at = pretty(range(daplot[,5])))
mtext("Calinski-Harabász Index", side=4, line=3)
par(std.par)
############
dindex
}
####### esta función calcula la suma de cuadrados centrados de un data.frame
css <- function(data) { # Centered sum of squares of a data.frame
sum(scale(data, center = TRUE, scale = FALSE)^2)
}
####### esta función calcul la suma de cuadratos dentro clases de una partición en hclust
wrap <- function(i, hc, x) { # i = # of clusters, hc = hclust, x = data
cl <- cutree(hc, i) # gets the partition in i groups
spl <- split(x, cl) # splits the data in classes
wss <- sum(sapply(spl, css)) # computes the wss
wss
}
Con los siguientes resultados
statiner.hclust(linnerud,lin.clust)
## ngr g1 g2 index bss wss Dwss D2wss CH max
## 1 19 -18 -19 9.7980 137583.50 48.000 69.0000 58.0000 477.7205
## 2 18 -1 -5 11.7473 137514.50 117.000 127.0000 80.5000 276.5500
## 3 17 -8 -11 15.9374 137387.50 244.000 207.5000 30.0000 175.9574
## 4 16 -12 -13 20.3715 137180.00 451.500 237.5000 9.5000 121.5327
## 5 15 -2 -6 21.7945 136942.50 689.000 247.0000 17.3333 99.3777
## 6 14 -15 -17 22.2261 136695.50 936.000 264.3333 316.1667 89.8721
## 7 13 -4 3 22.9928 136431.17 1200.333 580.5000 213.9167 85.2458
## 8 12 -7 5 34.0735 135850.67 1780.833 794.4167 472.7500 69.3499
## 9 11 -20 7 39.8602 135056.25 2575.250 1267.1667 676.8333 57.6883
## 10 10 -16 4 50.3422 133789.08 3842.417 1944.0000 327.0000 46.4253
## 11 9 -9 1 62.3538 131845.08 5786.417 2271.0000 876.6667 37.0261
## 12 8 -3 8 67.3944 129574.08 8057.417 3147.6667 481.4583 32.1627
## 13 7 -14 6 79.3431 126426.42 11205.083 3629.1250 232.5417 28.2074
## 14 6 9 12 85.1954 122797.29 14834.208 3861.6667 623.3583 26.4895
## 15 5 10 11 87.8825 118935.62 18695.875 4485.0250 5915.4853 27.0368
## 16 4 2 14 94.7103 114450.60 23180.900 10400.5103 13177.7278 29.6237 *
## 17 3 13 16 144.2256 104050.09 33581.410 23578.2381 56893.6136 29.4352
## 18 2 -10 15 217.1554 80471.85 57159.648 80471.8516 -80471.8516 28.1569
## 19 1 17 18 401.1779 0.00 137631.500 0.0000 0.0000 0.0000
Enfín hacemos un plot para el dendrograma con clases
plot(lin.clust,hang=-1)
rect.hclust(lin.clust,k=4)
abline(h=120,col="red")
# the following is to get numbers close to the nodes
# hiera <- series.hfc$hier
#lab <- c("*06*","*07*","*08*","*09*")
#xcoor <- c(1.5,2.25,4.5,3.375)
#plot(as.dendrogram(hiera),main= paste(namedata,"- HFC Dendrogram"),
# ylab="Index",ylim=c(0,1.2),cex=0.7)
#text(xcoor,hiera$height,pos=3,labels=lab,cex=0.7,offset=0.2)
#abline(h=0.7,col="red")
Ahora hacemos una clasificación jerárquica sobre los datos de Ellenberg. Entonces estamos colgando las funciones para el cálculo de la matriz de similitud.
matsim <- function(data,simil=NULL,rows=FALSE) {
if(rows) {
data <- t(as.matrix(data)) # si por filas, transpone
}
nr <- dim(data)[1] # número filas
nc <- dim(data)[2] # número columnas
namcols <- colnames(data) # nombres columnas
ass <- matrix(0,nrow=nc,ncol=nc, dimnames = list(namcols,namcols)) # vacía
for (i in (1:nc)) { # dos iteraciones columnas
for (j in (1:i)) { # para llenar mitad matriz
a <- 0 # inicialización
b <- 0
c <- 0
d <- 0
for (k in 1:nr) { # iteración sobre filas
if (data[k,i] == 1){
if (data[k,j] == 1){
a <- a+1 # a (p/p)
} else {
b <- b+1 # b (p/a)
}
} else {
if (data[k,j] == 1){
c <- c+1 # c (a/p)
} else {
d <- d+1 # d (a/a)
}
}
}
ass[i,j] <- similitud(a,b,c,d,simil) # valores en ass
if (i!=j) ass [j,i] <- ass[i,j] # también otra mitad
}
}
ass
}
La función similitud cálcula las similitudes eligidas.
similitud <- function(a,b,c,d,simil) {
if (simil=="sorensen") {
sim <- 2*a/(2*a+b+c) # Sorensen
} else if (simil=="vandermaarel") {
sim <- (2*a-(b+c))/(2*a+(b+c)) # Van der Maarel
}
sim
}
Aquí se intenta el legame completo con el índice de Sorensen.
ell <- read.csv("Ellenberg.csv",header=TRUE,row.names=1)
ell.sor <- matsim(ell,simil="sorensen")
eld <- as.dist(1-ell.sor)
ell.hcl <- hclust(eld,method="complete")
st.cl <- stat.hclust(ell.hcl)
st.cl
## ngr g1 g2 index D D2
## 1 24 -5 -13 0.1594 0.0406 -0.0167
## 2 23 -1 -4 0.2000 0.0239 -0.0097
## 3 22 -17 -21 0.2239 0.0142 -0.0023
## 4 21 -22 1 0.2381 0.0119 -0.0119
## 5 20 -6 4 0.2500 0.0000 0.0167
## 6 19 -20 3 0.2500 0.0167 -0.0141
## 7 18 -15 -24 0.2667 0.0026 -0.0020
## 8 17 -11 -19 0.2692 0.0006 0.0410
## 9 16 -9 -10 0.2698 0.0416 -0.0406
## 10 15 -23 5 0.3115 0.0010 0.0198
## 11 14 -8 -12 0.3125 0.0208 -0.0104
## 12 13 -3 -16 0.3333 0.0104 -0.0104
## 13 12 -7 10 0.3438 0.0000 0.0122
## 14 11 6 11 0.3438 0.0122 -0.0014
## 15 10 7 9 0.3559 0.0107 0.0364
## 16 9 13 14 0.3667 0.0471 -0.0234
## 17 8 -25 8 0.4138 0.0237 -0.0237
## 18 7 2 12 0.4375 0.0000 0.0240
## 19 6 -14 16 0.4375 0.0240 -0.0041
## 20 5 -2 -18 0.4615 0.0199 -0.0105
## 21 4 15 18 0.4815 0.0094 0.0923
## 22 3 17 19 0.4909 0.1017 0.0148
## 23 2 20 21 0.5926 0.1165 -0.1165
## 24 1 22 23 0.7091 0.0000 0.0000
plot(ell.hcl,hang=-1)
abline(h=.39,col="blue")
abline(h=.55,col="red")
# hay que escribir la linea al promedio de los nodos
ell.hcl <- hclust(eld,method="single")
st.cl <- stat.hclust(ell.hcl)
st.cl
## ngr g1 g2 index D D2
## 1 24 -5 -13 0.1594 0.0290 -0.0174
## 2 23 -6 1 0.1884 0.0116 -0.0047
## 3 22 -1 -4 0.2000 0.0069 -0.0007
## 4 21 -22 2 0.2069 0.0062 0.0045
## 5 20 -23 4 0.2131 0.0108 -0.0013
## 6 19 -17 -21 0.2239 0.0095 -0.0047
## 7 18 5 6 0.2333 0.0048 0.0030
## 8 17 -20 7 0.2381 0.0078 -0.0037
## 9 16 -11 8 0.2459 0.0041 0.0074
## 10 15 -7 9 0.2500 0.0115 -0.0064
## 11 14 -8 10 0.2615 0.0051 -0.0026
## 12 13 -15 -24 0.2667 0.0026 -0.0020
## 13 12 -19 11 0.2692 0.0006 -0.0006
## 14 11 -12 13 0.2698 0.0000 0.0132
## 15 10 -9 -10 0.2698 0.0132 0.0153
## 16 9 -18 14 0.2830 0.0285 -0.0256
## 17 8 3 12 0.3115 0.0028 0.0043
## 18 7 -16 16 0.3143 0.0071 -0.0065
## 19 6 17 18 0.3214 0.0006 0.0107
## 20 5 15 19 0.3220 0.0113 -0.0113
## 21 4 -3 20 0.3333 0.0000 0.0109
## 22 3 -25 21 0.3333 0.0109 0.0266
## 23 2 -14 22 0.3443 0.0376 -0.0376
## 24 1 -2 23 0.3818 0.0000 0.0000
plot(ell.hcl,hang=-1)
abline(h=.30,col="blue")
abline(h=.328,col="red")
# hay que escribir la linea al promedio de los nodos