Clasificación jerárquica con R

Hierarchical Clustering

Description
Hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it.

Usage
hclust(d, method = “complete”, members = NULL)

S3 method for class ‘hclust’
plot(x, labels = NULL, hang = 0.1, check = TRUE, axes = TRUE, frame.plot = FALSE,
ann = TRUE, main = “Cluster Dendrogram”, sub = NULL, xlab = NULL, ylab = “Height”, …)

Arguments
d a dissimilarity structure as produced by dist.
method the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC).
members NULL or a vector with length size of d. See the ‘Details’ section.
x an object of the type produced by hclust.
hang The fraction of the plot height by which labels should hang below the rest of the plot. A negative value will cause the labels to hang down from 0.
check logical indicating if the x object should be checked for validity. This check is not necessary when x is known to be valid such as when it is the direct result of hclust(). The default is check=TRUE, as invalid inputs may crash R due to memory violation in the internal C plotting code.
labels A character vector of labels for the leaves of the tree. By default the row names or row numbers of the original data are used. If labels = FALSE no labels at all are plotted.
axes, frame.plot, ann logical flags as in plot.default.
main, sub, xlab, ylab character strings for title, sub and xlab have a non-NULL default when there’s a tree$call.
… Further graphical arguments. E.g., cex controls the size of the labels (if plotted) in the same way as text.

Details This function performs a hierarchical cluster analysis using a set of dissimilarities for the nn objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. At each stage distances between clusters are recomputed by the Lance–Williams dissimilarity update formula according to the particular clustering method being used.

A number of different clustering methods are provided. Ward’s minimum variance method aims at finding compact, spherical clusters. The complete linkage method finds similar clusters. The single linkage method (which is closely related to the minimal spanning tree) adopts a ‘friends of friends’ clustering strategy. The other methods can be regarded as aiming for clusters with characteristics somewhere between the single and complete link methods. Note however, that methods “median” and “centroid” are not leading to a monotone distance measure, or equivalently the resulting dendrograms can have so called inversions or reversals which are hard to interpret, but note the trichotomies in Legendre and Legendre (2012).

Two different algorithms are found in the literature for Ward clustering. The one used by option “ward.D” (equivalent to the only Ward option “ward” in R versions ≤ 3.0.3) does not implement Ward’s (1963) clustering criterion, whereas option “ward.D2” implements that criterion (Murtagh and Legendre 2014). With the latter, the dissimilarities are squared before cluster updating. Note that agnes(, method=“ward”) corresponds to hclust(, “ward.D2”).

If members != NULL, then d is taken to be a dissimilarity matrix between clusters instead of dissimilarities between singletons and members gives the number of observations per cluster. This way the hierarchical cluster algorithm can be ‘started in the middle of the dendrogram’, e.g., in order to reconstruct the part of the tree above a cut (see examples). Dissimilarities between clusters can be efficiently computed (i.e., without hclust itself) only for a limited number of distance/linkage combinations, the simplest one being squared Euclidean distance and centroid linkage. In this case the dissimilarities between the clusters are the squared Euclidean distances between cluster means.

In hierarchical cluster displays, a decision is needed at each merge to specify which subtree should go on the left and which on the right. Since, for \(n\) observations there are \(n-1\) merges, there are \(2^{(n-1)}\) possible orderings for the leaves in a cluster tree, or dendrogram. The algorithm used in hclust is to order the subtree so that the tighter cluster is on the left (the last, i.e., most recent, merge of the left subtree is at a lower value than the last merge of the right subtree). Single observations are the tightest clusters possible, and merges involving two observations place them in order by their observation sequence number.

Value An object of class hclust which describes the tree produced by the clustering process. The object is a list with components:
merge an \(n-1 \times 2\) matrix. Row \(i\) of merge describes the merging of clusters at step \(i\) of the clustering. If an element \(j\) in the row is negative, then observation \(-j\) was merged at this stage. If \(j\) is positive then the merge was with the cluster formed at the (earlier) stage \(j\) of the algorithm. Thus negative entries in merge indicate agglomerations of singletons, and positive entries indicate agglomerations of non-singletons.
height a set of \(n−1\) real values (non-decreasing for ultrametric trees). The clustering height: that is, the value of the criterion associated with the clustering method for the particular agglomeration.
order a vector giving the permutation of the original observations suitable for plotting, in the sense that a cluster plot using this ordering and matrix merge will not have crossings of the branches.
labels labels for each of the objects being clustered.
call the call which produced the result.
method the cluster method that has been used.
dist.method the distance that has been used to create \(d\) (only returned if the distance object has a “method” attribute).

There are print, plot and identify (see identify.hclust) methods and the rect.hclust() function for hclust objects.

Note Method “centroid” is typically meant to be used with squared Euclidean distances.

Author(s)
The hclust function is based on Fortran code contributed to STATLIB by F. Murtagh.

References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S Language. Wadsworth & Brooks/Cole. (S version.)

Everitt, B. (1974). Cluster Analysis. London: Heinemann Educ. Books.

Hartigan, J.A. (1975). Clustering Algorithms. New York: Wiley.

Sneath, P. H. A. and R. R. Sokal (1973). Numerical Taxonomy. San Francisco: Freeman.

Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press: New York.

Gordon, A. D. (1999). Classification. Second Edition. London: Chapman and Hall / CRC

Murtagh, F. (1985). “Multidimensional Clustering Algorithms”, in COMPSTAT Lectures 4. Wuerzburg: Physica-Verlag (for algorithmic details of algorithms used).

McQuitty, L.L. (1966). Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data. Educational and Psychological Measurement, 26, 825–831. doi:10.1177/001316446602600402.

Legendre, P. and L. Legendre (2012). Numerical Ecology, 3rd English ed. Amsterdam: Elsevier Science BV.

Murtagh, Fionn and Legendre, Pierre (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? Journal of Classification, 31, 274–295. doi:10.1007/s00357-014-9161-z.\

See Also: identify.hclust, rect.hclust, cutree, dendrogram, kmeans.

For the Lance–Williams formula and methods that apply it generally, see agnes from package cluster.

Examples Run examples

Example 1: Violent crime rates by US state

hclust con distancia euclidiana y método promedio

require(graphics)                                                               # se necesita graphics
hc                 <- hclust(dist(USArrests), "ave")                            # clasificación                     
plot(hc)                                                                        # dendrogramas

plot(hc, hang = -1)                                                             # aliniado

hclust con distancia euclidiana al cuadrado y método baricentro

## Do the same with centroid clustering and *squared* Euclidean distance,
## cut the tree into ten clusters and reconstruct the upper part of the
## tree from the cluster centers.
hc                 <- hclust(dist(USArrests)^2, "cen")
memb               <- cutree(hc, k = 10)                                        # corta el arbol
cent <- NULL
for(k in 1:10){
  cent <- rbind(cent, colMeans(USArrests[memb == k, , drop = FALSE]))           # promedios de cada clase  
}
# clasificación a partir de diez baricentros
hc1 <- hclust(dist(cent)^2, method = "cen", members = table(memb))
opar <- par(mfrow = c(1, 2))                                                    # dos plots a lado
plot(hc,  labels = FALSE, hang = -1, main = "Original Tree")
plot(hc1, labels = FALSE, hang = -1, main = "Re-start from 10 clusters")

par(opar)                                                                       # reestablece parametros

Example 2: Straight-line distances among 10 US cities

Compare the results of algorithms “ward.D” and “ward.D2”

Nótese que con cmdscale() se construyen coordenadas para la representación gráfica de las ciudades.

mds2               <- -cmdscale(UScitiesD)                                      # mds coidades
plot(mds2, type="n", axes=FALSE, ann=FALSE)                                     # plot vacío
text(mds2, labels=rownames(mds2), xpd = NA)                                     # solo etiquetas

# se comparan los dos ward
hcity.D  <- hclust(UScitiesD, "ward.D")                                         # "wrong"
hcity.D2 <- hclust(UScitiesD, "ward.D2")                                        # good 
opar <- par(mfrow = c(1, 2))
  plot(hcity.D,  hang=-1)
  plot(hcity.D2, hang=-1)

par(opar)

Aquí se va testando una clasificación jerarquica con hclust. Pero se necesita de calcular la matriz de distancias previamente.

Intentamos con los datos de Linnerud.

linnerud <- read.csv("Linnerud.csv",header = TRUE, row.names = 1)
n                  <- dim(linnerud)[1]
lin.dis            <- dist(linnerud);round(lin.dis,3)                           # distancia euclidiana
##          1       2       3       4       5       6       7       8       9      10      11      12      13      14      15      16      17      18      19
## 2   52.173                                                                                                                                                
## 3   74.317  43.761                                                                                                                                        
## 4   69.376  38.562  71.400                                                                                                                                
## 5   11.747  46.797  70.250  62.666                                                                                                                        
## 6   64.521  21.794  60.614  23.281  58.335                                                                                                                
## 7   68.220  33.196  65.673  49.790  62.722  29.614                                                                                                        
## 8   49.497  34.612  70.915  21.794  44.204  28.792  50.517                                                                                                
## 9   52.631  96.737 118.849  96.948  57.637 101.543 107.014  77.408                                                                                        
## 10 213.492 239.714 215.119 258.484 217.773 258.314 266.081 245.526 217.984                                                                                
## 11  53.666  34.957  70.788  21.095  45.365  27.477  47.518  15.937  83.917 249.734                                                                        
## 12  77.660 117.009 113.437 131.263  82.722 132.559 141.071 113.864  79.480 141.711 118.571                                                                
## 13  80.523 120.715 120.785 129.603  85.790 133.854 144.395 112.134  71.021 149.650 118.085  20.372                                                        
## 14 126.079  84.558  91.471 103.947 121.384  83.875  64.730 111.328 169.228 299.184 107.369 191.580 198.514                                                
## 15  96.571  50.100  77.750  50.150  89.554  36.152  37.974  63.103 134.759 287.181  57.149 165.768 168.565  61.790                                        
## 16  78.861 118.191 111.086 139.707  85.434 135.890 136.890 122.061  85.820 144.783 126.850  37.921  50.705 181.304 167.018                                
## 17 108.959  62.466  88.470  49.970 102.137  44.844  55.642  67.676 142.766 296.275  63.340 175.502 176.403  76.629  22.226 179.803                        
## 18  78.968 126.214 135.834 132.699  84.747 137.080 146.226 113.217  57.983 171.473 118.541  41.364  31.875 204.206 171.388  64.428 179.900                
## 19  74.081 121.128 132.586 125.734  79.284 130.947 140.371 106.433  50.180 178.933 111.535  45.793  35.071 199.539 165.263  68.110 173.396   9.798        
## 20  78.384  56.232  81.823  27.875  73.892  46.637  75.100  34.000  98.803 251.704  40.447 127.848 123.637 126.586  72.650 142.046  66.963 128.328 121.713
lin.clust          <- hclust(lin.dis,method="ward.D2"); lin.clust               # clasificación              
## 
## Call:
## hclust(d = lin.dis, method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 20
str(lin.clust)                                                                  # contenido salida
## List of 7
##  $ merge      : int [1:19, 1:2] -18 -1 -8 -12 -2 -15 -4 -7 -20 -16 ...
##  $ height     : num [1:19] 9.8 11.7 15.9 20.4 21.8 ...
##  $ order      : int [1:20] 14 15 17 1 5 20 4 8 11 3 ...
##  $ labels     : chr [1:20] "1" "2" "3" "4" ...
##  $ method     : chr "ward.D2"
##  $ call       : language hclust(d = lin.dis, method = "ward.D2")
##  $ dist.method: chr "euclidean"
##  - attr(*, "class")= chr "hclust"
plot(lin.clust)                                                                 # dendrograma

plot(lin.clust,hang=-1,main="Datos de Linnerud")                                                         # dendrograma mejor

hclust brinda muchas salidas: se les vamos a salir todas:

noquote("contenido de hclust")
## [1] contenido de hclust
summary(lin.clust)
##             Length Class  Mode     
## merge       38     -none- numeric  
## height      19     -none- numeric  
## order       20     -none- numeric  
## labels      20     -none- character
## method       1     -none- character
## call         3     -none- call     
## dist.method  1     -none- character
noquote("pareja de nodos que se juntan")
## [1] pareja de nodos que se juntan
lin.clust$merge
##       [,1] [,2]
##  [1,]  -18  -19
##  [2,]   -1   -5
##  [3,]   -8  -11
##  [4,]  -12  -13
##  [5,]   -2   -6
##  [6,]  -15  -17
##  [7,]   -4    3
##  [8,]   -7    5
##  [9,]  -20    7
## [10,]  -16    4
## [11,]   -9    1
## [12,]   -3    8
## [13,]  -14    6
## [14,]    9   12
## [15,]   10   11
## [16,]    2   14
## [17,]   13   16
## [18,]  -10   15
## [19,]   17   18
noquote("índice de la jerarquía")
## [1] índice de la jerarquía
lin.clust$height
##  [1]   9.797959  11.747340  15.937377  20.371549  21.794495  22.226111  22.992752  34.073450  39.860172  50.342163  62.353829  67.394362  79.343137  85.195364  87.882497  94.710348 144.225589 217.155419 401.177895
noquote("orden horizontal de las hojas")
## [1] orden horizontal de las hojas
lin.clust$order
##  [1] 14 15 17  1  5 20  4  8 11  3  7  2  6 10 16 12 13  9 18 19
noquote("etiquetas de las unidades en orden original")
## [1] etiquetas de las unidades en orden original
lin.clust$labels
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
noquote("método de aglomeración")
## [1] método de aglomeración
lin.clust$method
## [1] "ward.D2"
noquote("comando")
## [1] comando
lin.clust$call
## hclust(d = lin.dis, method = "ward.D2")
noquote("tipo de distancia empleado")
## [1] tipo de distancia empleado
lin.clust$dist.method
## [1] "euclidean"
noquote("hacemos todas las particiones")
## [1] hacemos todas las particiones
part               <- cutree(lin.clust, k = 1:(n-1))                            # particiones
part
##    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 1  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  1
## 2  1 1 1 1 2 2 2 2 2  2  2  2  2  2  2  2  2  2  2
## 3  1 1 1 1 2 2 2 2 3  3  3  3  3  3  3  3  3  3  3
## 4  1 1 1 1 2 2 3 3 4  4  4  4  4  4  4  4  4  4  4
## 5  1 1 1 1 1 1 1 1 1  1  1  1  1  1  1  1  1  1  5
## 6  1 1 1 1 2 2 2 2 2  2  2  2  2  2  2  5  5  5  6
## 7  1 1 1 1 2 2 2 2 2  2  2  2  5  5  5  6  6  6  7
## 8  1 1 1 1 2 2 3 3 4  4  4  4  4  6  6  7  7  7  8
## 9  1 2 2 2 3 3 4 4 5  5  5  5  6  7  7  8  8  8  9
## 10 1 2 3 3 4 4 5 5 6  6  6  6  7  8  8  9  9  9 10
## 11 1 1 1 1 2 2 3 3 4  4  4  4  4  6  6  7  7 10 11
## 12 1 2 2 2 3 5 6 6 7  7  7  7  8  9  9 10 10 11 12
## 13 1 2 2 2 3 5 6 6 7  7  7  7  8  9  9 10 11 12 13
## 14 1 1 1 4 5 6 7 7 8  8  8  8  9 10 10 11 12 13 14
## 15 1 1 1 4 5 6 7 8 9  9  9  9 10 11 11 12 13 14 15
## 16 1 2 2 2 3 5 6 6 7  7 10 10 11 12 12 13 14 15 16
## 17 1 1 1 4 5 6 7 8 9  9  9  9 10 11 13 14 15 16 17
## 18 1 2 2 2 3 3 4 4 5 10 11 11 12 13 14 15 16 17 18
## 19 1 2 2 2 3 3 4 4 5 10 11 11 12 13 14 15 16 17 18
## 20 1 1 1 1 2 2 3 3 4  4  4 12 13 14 15 16 17 18 19

En cada columna se encuentra la indicación de la clase por cada fila.

Vemos ahora de organizar esta salida de manera apropiada, construyendo una tabla de valores por cada nivel de la jerarquía:

data.hcl           <- lin.clust                                                 # salida hclust
maxn               <- n-1                                                       # tamaño arbol
index              <- lin.clust$height                                          # índice jerarquía                                       
Di                 <- c(index[2:maxn]-index[1:(maxn-1)],0)                      # 1 derivada
Di2                <- c(Di[2:maxn]-Di[1:(maxn-1)],0)                            # 2 derivada 
index              <- cbind(c((maxn):1),lin.clust$merge,index,Di,Di2)           # tabla 
rownames(index)    <- c(1:maxn)                                                 
colnames(index)    <- c("ngr","g1","g2","index","D","D2")
dindex             <- as.data.frame(cbind(index[,1:3],round(index[,4:6],4)))
dindex             
##    ngr  g1  g2    index        D        D2
## 1   19 -18 -19   9.7980   1.9494    2.2407
## 2   18  -1  -5  11.7473   4.1900    0.2441
## 3   17  -8 -11  15.9374   4.4342   -3.0112
## 4   16 -12 -13  20.3715   1.4229   -0.9913
## 5   15  -2  -6  21.7945   0.4316    0.3350
## 6   14 -15 -17  22.2261   0.7666   10.3141
## 7   13  -4   3  22.9928  11.0807   -5.2940
## 8   12  -7   5  34.0735   5.7867    4.6953
## 9   11 -20   7  39.8602  10.4820    1.5297
## 10  10 -16   4  50.3422  12.0117   -6.9711
## 11   9  -9   1  62.3538   5.0405    6.9082
## 12   8  -3   8  67.3944  11.9488   -6.0965
## 13   7 -14   6  79.3431   5.8522   -3.1651
## 14   6   9  12  85.1954   2.6871    4.1407
## 15   5  10  11  87.8825   6.8279   42.6874
## 16   4   2  14  94.7103  49.5152   23.4146
## 17   3  13  16 144.2256  72.9298  111.0926
## 18   2 -10  15 217.1554 184.0225 -184.0225
## 19   1  17  18 401.1779   0.0000    0.0000
daplot             <- dindex[maxn:1,4:6]
methods            <- paste("Dissimilarity:",data.hcl$dist.method,
                            ", agglomeration:",data.hcl$method)
########
plot(1:maxn, daplot[,1], type="l", xlab="Number of Clusters",                 # gráfico 
       ylab="Index", main = "K-means inertia variation",sub=methods)
  lines(1:maxn, daplot[,2],col="blue")
  lines(1:maxn, daplot[,3],col="green")
  legend("topright",legend=c("Index","Differences",
         "Second differences"),
         col=c("black","blue","green"),lty=1,cex=0.75)

  #######

De esto también se puede hacer una función

stat.hclust        <- function(hclust) {
  n                  <- length(hclust$labels)                                   # número unidades
  maxn               <- n-1                                                     # tamaño arbol
  index              <- hclust$height                                           # índice jerarquía                                       
  Di                 <- c(index[2:maxn]-index[1:(maxn-1)],0)                    # 1 derivada
  Di2                <- c(Di[2:maxn]-Di[1:(maxn-1)],0)                          # 2 derivada
  index              <- cbind(c((maxn):1),hclust$merge,index,Di,Di2)            # tabla 
  rownames(index)    <- c(1:maxn)                                               # etiquetas   
  colnames(index)    <- c("ngr","g1","g2","index","D","D2")
  dindex             <- as.data.frame(cbind(index[,1:3],round(index[,4:6],4)))  # salida
  daplot             <- index[maxn:1,4:6]                                       # salida para plot
  methods            <- paste("Dissimilarity:",hclust$dist.method,              # títulos plot
                              ", agglomeration:",hclust$method)
########
  plot(1:maxn, daplot[,1], type="l", xlab="Number of Clusters", 
       ylim = range(daplot[,1:3]),                                              # gráfico 
       ylab="Index", main = "Statistics of hierarchy indices",sub=methods)
    lines(1:maxn, daplot[,2],col="blue")                                        # dos derivadas
    lines(1:maxn, daplot[,3],col="green")
    legend("topright",legend=c("Index","Differences",                           # leyenda
           "Second differences"),col=c("black","blue","green"),lty=1,cex=0.75)
  #######
  dindex             
}

Se resulta el siguente:

stat.hclust(lin.clust)

##    ngr  g1  g2    index        D        D2
## 1   19 -18 -19   9.7980   1.9494    2.2407
## 2   18  -1  -5  11.7473   4.1900    0.2441
## 3   17  -8 -11  15.9374   4.4342   -3.0112
## 4   16 -12 -13  20.3715   1.4229   -0.9913
## 5   15  -2  -6  21.7945   0.4316    0.3350
## 6   14 -15 -17  22.2261   0.7666   10.3141
## 7   13  -4   3  22.9928  11.0807   -5.2940
## 8   12  -7   5  34.0735   5.7867    4.6953
## 9   11 -20   7  39.8602  10.4820    1.5297
## 10  10 -16   4  50.3422  12.0117   -6.9711
## 11   9  -9   1  62.3538   5.0405    6.9082
## 12   8  -3   8  67.3944  11.9488   -6.0965
## 13   7 -14   6  79.3431   5.8522   -3.1651
## 14   6   9  12  85.1954   2.6871    4.1407
## 15   5  10  11  87.8825   6.8279   42.6874
## 16   4   2  14  94.7103  49.5152   23.4146
## 17   3  13  16 144.2256  72.9298  111.0926
## 18   2 -10  15 217.1554 184.0225 -184.0225
## 19   1  17  18 401.1779   0.0000    0.0000

En el caso de la distancia euclidiana se puede hacer un tratamiento parecido a lo hecho por k-means.

En primero lugar, se necesita de introducir dos funciones por el cálculo de las inercias.

####### esta función calcula la suma de cuadrados centrados de un data.frame
css                <- function(data) {             # Centered sum of squares of a data.frame
  sum(scale(data, center = TRUE, scale = FALSE)^2)
} 
####### esta función calcul la suma de cuadratos dentro clases de una partición en hclust
wrap               <- function(i, hc, x) {         # i = # of clusters, hc = hclust, x = data
  cl               <- cutree(hc, i)                # gets the partition in i groups 
  spl              <- split(x, cl)                 # splits the data in classes
  wss              <- sum(sapply(spl, css))        # computes the wss
  wss 
}
################################################################################

Luego se construye una tabla parecida, pero con inercias.

  data             <- linnerud                                                  # datos de ejemplo
  hclust           <- lin.clust                                                 # salida hclust
  n                <- length(hclust$labels)                                     # número unidades
  maxn             <- n-1                                                       # tamaño arbol
  index            <- hclust$height                                             # índice jerarquía                                       
  ast              <- rep("",maxn)                                              # max 
  wss              <- rep(0,maxn)                                               # within ss 
  wssd             <- wss                                                       # 1 derivadas
  wssd2            <- wss                                                       # 2 derivadas
  indx             <- hclust$height                                             # indice
  tss              <- css(data)
  for (i in 1:maxn) {                                                             
    wss[i]         <- wrap((maxn-i+1),hclust,data)                              # llena wss
  }
  bss              <- tss - wss                                                 # between ss
  wssd             <- c(wss[2:maxn] - wss[1:(maxn-1)],0)                        # 1 derivada
  wssd2            <- c(wssd[2:maxn] - wssd[1:(maxn-1)],0)                      # 2 derivada
  df1              <- c(maxn:1)-1                                               # grados libertad  
  df2              <- (n-df1+1)
  CH               <- (bss/df1)/(wss/df2)                                       # índice de Calinski
  CH[maxn]         <- 0                                                         # fix NaN
  for (i in 2:(maxn-1)) {                                                       # detect
    if ((CH[i] > CH[i-1]) & (CH[i] > CH[i+1])) {                                # local 
      ast[i]       <- "*"                                                       # maxima  
    }
  }
  index            <- cbind(index,bss,wss,wssd,wssd2,CH)                        # tabla
  dindex           <- data.frame(c(maxn:1),hclust$merge,round(index,4),ast)     # completa
  colnames(dindex) <- c("ngr","g1","g2","index","bss","wss",                    # etiquetas
                        "Dwss","D2wss","CH","max")
  rownames(dindex) <- c(1:maxn)
  dindex
##    ngr  g1  g2    index       bss        wss       Dwss       D2wss       CH max
## 1   19 -18 -19   9.7980 137583.50     48.000    69.0000     58.0000 477.7205    
## 2   18  -1  -5  11.7473 137514.50    117.000   127.0000     80.5000 276.5500    
## 3   17  -8 -11  15.9374 137387.50    244.000   207.5000     30.0000 175.9574    
## 4   16 -12 -13  20.3715 137180.00    451.500   237.5000      9.5000 121.5327    
## 5   15  -2  -6  21.7945 136942.50    689.000   247.0000     17.3333  99.3777    
## 6   14 -15 -17  22.2261 136695.50    936.000   264.3333    316.1667  89.8721    
## 7   13  -4   3  22.9928 136431.17   1200.333   580.5000    213.9167  85.2458    
## 8   12  -7   5  34.0735 135850.67   1780.833   794.4167    472.7500  69.3499    
## 9   11 -20   7  39.8602 135056.25   2575.250  1267.1667    676.8333  57.6883    
## 10  10 -16   4  50.3422 133789.08   3842.417  1944.0000    327.0000  46.4253    
## 11   9  -9   1  62.3538 131845.08   5786.417  2271.0000    876.6667  37.0261    
## 12   8  -3   8  67.3944 129574.08   8057.417  3147.6667    481.4583  32.1627    
## 13   7 -14   6  79.3431 126426.42  11205.083  3629.1250    232.5417  28.2074    
## 14   6   9  12  85.1954 122797.29  14834.208  3861.6667    623.3583  26.4895    
## 15   5  10  11  87.8825 118935.62  18695.875  4485.0250   5915.4853  27.0368    
## 16   4   2  14  94.7103 114450.60  23180.900 10400.5103  13177.7278  29.6237   *
## 17   3  13  16 144.2256 104050.09  33581.410 23578.2381  56893.6136  29.4352    
## 18   2 -10  15 217.1554  80471.85  57159.648 80471.8516 -80471.8516  28.1569    
## 19   1  17  18 401.1779      0.00 137631.500     0.0000      0.0000   0.0000

Para hacer un plot de las inercias, se necesita de cambiar el orden en los vectores.

  daplot             <- dindex[maxn:1,c(5:10)]
  methods            <- paste("Dissimilarity:",hclust$dist.method,              # títulos plot
                              ", agglomeration:",hclust$method)
std.par            <- par(mar=c(5, 4, 4, 4)+.1)                                 # par
  plot(1:maxn, daplot[,1], type="l", col="violet", xlab="Number of Clusters", 
       ylim = range(daplot[,1:4]),
       ylab="Inertia", main = "Statistics on inertia variation",sub = methods)
    lines(1:maxn, daplot[,2],col="black")
    lines(1:maxn, daplot[,3],col="blue")
    lines(1:maxn, daplot[,4],col="green")
    legend("topright",legend=c("Between inertia","Within inertia","Differences",
           "Second differences","Calinski index"),
           col=c("violet","black","blue","green","red"),lty=1,cex=0.75)
    par(new = TRUE)
    plot(1:maxn, daplot[,5], type = "l", axes = FALSE, bty = "n", xlab = "", 
         ylab = "",col="red",ylim=c(range(daplot[,5])*1.5))
      text(1:maxn,daplot[,5], labels=daplot[,6],pos=3,col="red",cex=0.75,offset=0.01)
      axis(side=4, at = pretty(range(daplot[,5])))
      mtext("Calinski-Harabász Index", side=4, line=3)

    par(std.par)

De esto también se hace una función:

statiner.hclust    <- function(data,hclust) {
  n                <- length(hclust$labels)                                     # número unidades
  maxn             <- n-1                                                       # tamaño arbol
  index            <- hclust$height                                             # índice jerarquía                                       
  ast              <- rep("",maxn)                                              # max 
  wss              <- rep(0,maxn)                                               # within ss 
  wssd             <- wss                                                       # 1 derivadas
  wssd2            <- wss                                                       # 2 derivadas
  indx             <- hclust$height                                             # indice
  tss              <- css(data)
  for (i in 1:maxn) {                                                             
    wss[i]         <- wrap((maxn-i+1),hclust,data)                              # llena wss
  }
  bss              <- tss - wss                                                 # between ss
  wssd             <- c(wss[2:maxn] - wss[1:(maxn-1)],0)                        # 1 derivada
  wssd2            <- c(wssd[2:maxn] - wssd[1:(maxn-1)],0)                      # 2 derivada
  df1              <- c(maxn:1)-1                                               # grados libertad  
  df2              <- (n-df1+1)
  CH               <- (bss/df1)/(wss/df2)                                       # índice de Calinski
  CH[maxn]         <- 0                                                         # fix NaN
  for (i in 2:(maxn-1)) {                                                       # detect
    if ((CH[i] > CH[i-1]) & (CH[i] > CH[i+1])) {                                # local 
      ast[i]       <- "*"                                                       # maxima  
    }
  }
  index            <- cbind(index,bss,wss,wssd,wssd2,CH)                        # tabla
  dindex           <- data.frame(c(maxn:1),hclust$merge,round(index,4),ast)     # completa
  colnames(dindex) <- c("ngr","g1","g2","index","bss","wss",                    # etiquetas
                        "Dwss","D2wss","CH","max")
  rownames(dindex) <- c(1:maxn)
  ##########
  daplot             <- dindex[maxn:1,c(5:10)]                                  # datos por plot
  methods            <- paste("Dissimilarity:",hclust$dist.method,              # títulos plot
                              ", agglomeration:",hclust$method)
  std.par            <- par(mar=c(5, 4, 4, 4)+.1)                               # par
  plot(1:maxn, daplot[,1], type="l", col="violet", xlab="Number of Clusters", 
       ylim = range(daplot[,1:4]),
       ylab="Inertia", main = "Statistics on inertia variation",sub = methods)
    lines(1:maxn, daplot[,2],col="black")
    lines(1:maxn, daplot[,3],col="blue")
    lines(1:maxn, daplot[,4],col="green")
    legend("topright",legend=c("Between inertia","Within inertia","Differences",
           "Second differences","Calinski index"),
           col=c("violet","black","blue","green","red"),lty=1,cex=0.75)
    par(new = TRUE)
    plot(1:maxn, daplot[,5], type = "l", axes = FALSE, bty = "n", xlab = "", 
         ylab = "",col="red",ylim=c(range(daplot[,5])*1.5))
      text(1:maxn,daplot[,5], labels=daplot[,6],pos=3,col="red",cex=0.75,offset=0.01)
      axis(side=4, at = pretty(range(daplot[,5])))
      mtext("Calinski-Harabász Index", side=4, line=3)
    par(std.par)
    ############
    dindex
}
####### esta función calcula la suma de cuadrados centrados de un data.frame
css                <- function(data) {             # Centered sum of squares of a data.frame
  sum(scale(data, center = TRUE, scale = FALSE)^2)
} 
####### esta función calcul la suma de cuadratos dentro clases de una partición en hclust
wrap               <- function(i, hc, x) {         # i = # of clusters, hc = hclust, x = data
  cl               <- cutree(hc, i)                # gets the partition in i groups 
  spl              <- split(x, cl)                 # splits the data in classes
  wss              <- sum(sapply(spl, css))        # computes the wss
  wss 
}

Con los siguientes resultados

statiner.hclust(linnerud,lin.clust)

##    ngr  g1  g2    index       bss        wss       Dwss       D2wss       CH max
## 1   19 -18 -19   9.7980 137583.50     48.000    69.0000     58.0000 477.7205    
## 2   18  -1  -5  11.7473 137514.50    117.000   127.0000     80.5000 276.5500    
## 3   17  -8 -11  15.9374 137387.50    244.000   207.5000     30.0000 175.9574    
## 4   16 -12 -13  20.3715 137180.00    451.500   237.5000      9.5000 121.5327    
## 5   15  -2  -6  21.7945 136942.50    689.000   247.0000     17.3333  99.3777    
## 6   14 -15 -17  22.2261 136695.50    936.000   264.3333    316.1667  89.8721    
## 7   13  -4   3  22.9928 136431.17   1200.333   580.5000    213.9167  85.2458    
## 8   12  -7   5  34.0735 135850.67   1780.833   794.4167    472.7500  69.3499    
## 9   11 -20   7  39.8602 135056.25   2575.250  1267.1667    676.8333  57.6883    
## 10  10 -16   4  50.3422 133789.08   3842.417  1944.0000    327.0000  46.4253    
## 11   9  -9   1  62.3538 131845.08   5786.417  2271.0000    876.6667  37.0261    
## 12   8  -3   8  67.3944 129574.08   8057.417  3147.6667    481.4583  32.1627    
## 13   7 -14   6  79.3431 126426.42  11205.083  3629.1250    232.5417  28.2074    
## 14   6   9  12  85.1954 122797.29  14834.208  3861.6667    623.3583  26.4895    
## 15   5  10  11  87.8825 118935.62  18695.875  4485.0250   5915.4853  27.0368    
## 16   4   2  14  94.7103 114450.60  23180.900 10400.5103  13177.7278  29.6237   *
## 17   3  13  16 144.2256 104050.09  33581.410 23578.2381  56893.6136  29.4352    
## 18   2 -10  15 217.1554  80471.85  57159.648 80471.8516 -80471.8516  28.1569    
## 19   1  17  18 401.1779      0.00 137631.500     0.0000      0.0000   0.0000

Enfín hacemos un plot para el dendrograma con clases

  plot(lin.clust,hang=-1)
    rect.hclust(lin.clust,k=4)
    abline(h=120,col="red")

# the following is to get numbers close to the nodes    
#    hiera       <- series.hfc$hier
#lab         <- c("*06*","*07*","*08*","*09*")
#xcoor       <- c(1.5,2.25,4.5,3.375)
#plot(as.dendrogram(hiera),main= paste(namedata,"- HFC Dendrogram"),
#     ylab="Index",ylim=c(0,1.2),cex=0.7)
#text(xcoor,hiera$height,pos=3,labels=lab,cex=0.7,offset=0.2)
#abline(h=0.7,col="red")

Ahora hacemos una clasificación jerárquica sobre los datos de Ellenberg. Entonces estamos colgando las funciones para el cálculo de la matriz de similitud.

matsim                  <- function(data,simil=NULL,rows=FALSE) {
  if(rows) {
    data                <- t(as.matrix(data))                                   # si por filas, transpone 
  }
  nr                    <- dim(data)[1]                                         # número filas
  nc                    <- dim(data)[2]                                         # número columnas
  namcols               <- colnames(data)                                       # nombres columnas
  ass                   <- matrix(0,nrow=nc,ncol=nc, dimnames = list(namcols,namcols)) # vacía
  for (i in (1:nc)) {                                                           # dos iteraciones columnas
    for (j in (1:i)) {                                                          # para llenar mitad matriz
      a                   <- 0                                                  # inicialización
      b                   <- 0
      c                   <- 0
      d                   <- 0
      for (k in 1:nr) {                                                         # iteración sobre filas
        if (data[k,i] == 1){
          if (data[k,j] == 1){
            a             <- a+1                                                # a  (p/p) 
          } else {
            b             <- b+1                                                # b  (p/a) 
          }  
        } else {
          if (data[k,j] == 1){
            c             <- c+1                                                # c (a/p)
          } else {
            d             <- d+1                                                # d (a/a)
          }  
        }
      }
      ass[i,j]            <- similitud(a,b,c,d,simil)                           # valores en ass
      if (i!=j) ass [j,i] <- ass[i,j]                                           # también otra mitad
    }
  }                       
  ass
}

La función similitud cálcula las similitudes eligidas.

similitud                 <- function(a,b,c,d,simil) {
  if        (simil=="sorensen") {
    sim                   <- 2*a/(2*a+b+c)                                      # Sorensen
  } else if (simil=="vandermaarel") {
    sim         <- (2*a-(b+c))/(2*a+(b+c))                                      # Van der Maarel
  }  
  sim
}

Aquí se intenta el legame completo con el índice de Sorensen.

ell                <- read.csv("Ellenberg.csv",header=TRUE,row.names=1)
ell.sor            <- matsim(ell,simil="sorensen")
eld                <- as.dist(1-ell.sor)
ell.hcl            <- hclust(eld,method="complete")
st.cl              <- stat.hclust(ell.hcl)     

st.cl
##    ngr  g1  g2  index      D      D2
## 1   24  -5 -13 0.1594 0.0406 -0.0167
## 2   23  -1  -4 0.2000 0.0239 -0.0097
## 3   22 -17 -21 0.2239 0.0142 -0.0023
## 4   21 -22   1 0.2381 0.0119 -0.0119
## 5   20  -6   4 0.2500 0.0000  0.0167
## 6   19 -20   3 0.2500 0.0167 -0.0141
## 7   18 -15 -24 0.2667 0.0026 -0.0020
## 8   17 -11 -19 0.2692 0.0006  0.0410
## 9   16  -9 -10 0.2698 0.0416 -0.0406
## 10  15 -23   5 0.3115 0.0010  0.0198
## 11  14  -8 -12 0.3125 0.0208 -0.0104
## 12  13  -3 -16 0.3333 0.0104 -0.0104
## 13  12  -7  10 0.3438 0.0000  0.0122
## 14  11   6  11 0.3438 0.0122 -0.0014
## 15  10   7   9 0.3559 0.0107  0.0364
## 16   9  13  14 0.3667 0.0471 -0.0234
## 17   8 -25   8 0.4138 0.0237 -0.0237
## 18   7   2  12 0.4375 0.0000  0.0240
## 19   6 -14  16 0.4375 0.0240 -0.0041
## 20   5  -2 -18 0.4615 0.0199 -0.0105
## 21   4  15  18 0.4815 0.0094  0.0923
## 22   3  17  19 0.4909 0.1017  0.0148
## 23   2  20  21 0.5926 0.1165 -0.1165
## 24   1  22  23 0.7091 0.0000  0.0000
plot(ell.hcl,hang=-1)
  abline(h=.39,col="blue")
  abline(h=.55,col="red")

# hay que escribir la linea al promedio de los nodos
ell.hcl            <- hclust(eld,method="single")
st.cl              <- stat.hclust(ell.hcl)     

st.cl
##    ngr  g1  g2  index      D      D2
## 1   24  -5 -13 0.1594 0.0290 -0.0174
## 2   23  -6   1 0.1884 0.0116 -0.0047
## 3   22  -1  -4 0.2000 0.0069 -0.0007
## 4   21 -22   2 0.2069 0.0062  0.0045
## 5   20 -23   4 0.2131 0.0108 -0.0013
## 6   19 -17 -21 0.2239 0.0095 -0.0047
## 7   18   5   6 0.2333 0.0048  0.0030
## 8   17 -20   7 0.2381 0.0078 -0.0037
## 9   16 -11   8 0.2459 0.0041  0.0074
## 10  15  -7   9 0.2500 0.0115 -0.0064
## 11  14  -8  10 0.2615 0.0051 -0.0026
## 12  13 -15 -24 0.2667 0.0026 -0.0020
## 13  12 -19  11 0.2692 0.0006 -0.0006
## 14  11 -12  13 0.2698 0.0000  0.0132
## 15  10  -9 -10 0.2698 0.0132  0.0153
## 16   9 -18  14 0.2830 0.0285 -0.0256
## 17   8   3  12 0.3115 0.0028  0.0043
## 18   7 -16  16 0.3143 0.0071 -0.0065
## 19   6  17  18 0.3214 0.0006  0.0107
## 20   5  15  19 0.3220 0.0113 -0.0113
## 21   4  -3  20 0.3333 0.0000  0.0109
## 22   3 -25  21 0.3333 0.0109  0.0266
## 23   2 -14  22 0.3443 0.0376 -0.0376
## 24   1  -2  23 0.3818 0.0000  0.0000
plot(ell.hcl,hang=-1)
  abline(h=.30,col="blue")
  abline(h=.328,col="red")

# hay que escribir la linea al promedio de los nodos