Data Reduction

UVA Biostatistics Discussion Board: Regression Modeling Strategies: Data Reduction
By Frank E Harrell Jr (Feh3k) on Tuesday, April 29, 2003 - 08:48 pm:

Variable clustering:Osman asked the following:

Can I use varclus() to measure and plot similarity for nominal variables (4 to 5 levels)? if yes what should the similarity parameter be (is it Hoeffding?) and should I change the method (ie compact vs. average) used by hclust()? In other words.. dose varclus have something equivalent to a percent agreement or a K-statistic?

Answer:I would tend to use Spearman instead of Hoeffding for that case, although Spearman is not ideal. It may work well enough. I don't know of one measure that is equally good for continuous and categorical variables. If you had all categorical variables, a new similarity measure could be added that would work a little better for those. For categorical variables, varclus generates all the dummy variables and calculates similarities separately for the dummies. It would be better to compute a single similarity (say, related to the Pearson chi-square) for all the levels of the categorical variable combined.

In terms of which clustering algorithm works best for certains types of variables, I have not compared the various algorithms.

By Frank E Harrell Jr (Feh3k) on Thursday, May 01, 2003 - 08:02 pm:

varclus for analysis of agreement of measurements: Osman Al-Radi asked:

300 patients had 1 to 4 diferent tests to measure myocardial contractility. contractility is measured on a scale of 1 to 4 (4 being worst). I used varclus as a tool for data reduction in a multivariable model.

However, my question is:
How can I use varclus to also report on the degree of inter-test agreement (similarity)? the standard varclus output dose not produce classical tests like %-agreement or kappa-statistic with p-values.

If varclus is not appropriate for this, are there functions I can use?

Answer:

The final output of varclus is too unstructured for that purpose but you might use the command varclus(...., similarity='bothpos' or 'ccbothpos')$sim to print the whole similarity matrix for one of those two similarities that are suitable for pairwise agreement of binary variables. 'ccbothpos' us not kappa but is a similar chance-corrected measure.