Re: To statisticians : Cluster Analysis
job.alerte@GMAIL.COM wrote back:
>Dear experts !
>Here are a little more precisions on my data and the aim of the study
>(it' a little more complex):
Yes, they're always a *little* more complex. Where 'little' is somewhere
between 'this will add a month to the job' to 'this is going to take the
ghost of Paul Erdos to work out the theory for it'. :-)
>- I defined two responsiveness criteria to treatment, say A and B, both
>binary Yes / No: one derived from a continuous measurement, the other
>one derived from a quality of life scale. Then, A and B are known for
>all of the patients.
I think that you mgit be better off if you did NOT turn these into
categorical variables here. Use as much information as possible, and
use continuous variables when you're doing cluster analysis.
>- I've got about 150 patients, on which X1, X2, .... , X15 were
>measured, all of them either categorical or categorized to facilitate
>odds-ratios estimations and further classification.
Okay, categorizing can make sense when you need to interpret
ORs. And particularly when you need to explain ORs to other
For the cluster analysis, I would go back to the uncategorized
variables whenever possible.
>- I've already planned 2 logistic regression in order to determine
>which of these factors improve responsiveness, which obstruct it and
>which have no impact: one model for each responsiveness criterion.
>- This drug had already shown efficience for A, but not for B in
>- The aim of the cluster analysis is to identify a subgroup of patients
>who are best likely to show correct responsiveness for both A and B,
>then to describe them. I agree that, with only one responsiveness
>criterion, a cluster analysis would not be of interest because groups
>were already formed since I defined a "diagnostic" variable "Responsive
>/ Not responsive". But here, I've got two criteria and I think it makes
>I put in the cluster analysis the predictors I put in the regression
>models, regardless of the significance of effect.
>I'm somewhat restricted because these analyses were required by the
>protocol of the trial and we're too short with deadline to amend it
>now. But if I had to decide on my own, I would have plan a multivariate
>regression, and component / correspondance analyses, ...
>What is your opinion ?
Well, a cluster analysis might help you here. A factor analysis or
component analysis might help also. If you want a single descriptor of
'correct A and B' vs. not, a principal component may be a lot more
useful than the cluster analysis. Or else you're going to end up
points as 'coorect A and B' vs. not, and then performing another analysis
to decribe that behavior.
You might want to start out with a simple plot, with A and B on your axes,
and the patients plotted out. You should be able to see which patients
are ending up in the right quadrant of your graph. Then you can check that
your analysis is giving you meaningful results. And if there is nothing
in this graph, then you may not be able to get useful information out of
an analysis. One of my mottos is:
"If you can see it in a graph, you should be able to find it in an analysis.
If you can find it in an analysis, you should be able to see it in a graph."
>Is the Ward's method performant for a set of categorical descriptors ?
>Which other distance could suit ? I was very surprised to not find the
>Chi2 distance in the methods proposed by Proc Cluster (option method =
>) whereas it seems a quite simple and natural similarity measurement...
Ward's is not designed for categorical descriptors. It basically
assumes multivariate normality. I don't recommend it.
What do you mean by the 'Chi2' distance? Do you mean simple Euclidean
>Other questions :
>1 / What about when descriptor include both categorical and continuous
>Do we have to define a distance for each continuous - continuous /
>continuous - categorical / categorical -categorical type of combination
>and carry out the cluster analysis from a table of distances ?
>2 / What about the QoL scales ? Should they be analysed as continuous
>Thanks in advance for your lights !
You can use a mixture of continuous and discrete variables, but you
will be happier if you step back to as many continuous variables as
If you have a mixture of these, then there is no simple way to define
one distance for one class of variables and another distance for a second
class of variables. Cluster analysis doesn't work that way.
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330
Stay in touch with old friends and meet new ones with Windows Live Spaces
|Thread||Thread Starter||Forum||Replies||Last Post|
|Vedr.: Re: Vedr.: Re: Cluster analysis in a geographical setting||Lars Thomassen||Newsgroup comp.soft-sys.sas||0||07-26-2005 11:30 AM|
|Re: Vedr.: Re: Cluster analysis in a geographical setting||Talbot Michael Katz||Newsgroup comp.soft-sys.sas||0||07-22-2005 04:22 PM|
|Re: Cluster analysis for binary data||Dennis G. Fisher||Newsgroup comp.soft-sys.sas||0||07-07-2005 05:49 PM|
|Re: Cluster analysis for binary data||Wensui Liu||Newsgroup comp.soft-sys.sas||0||07-07-2005 03:39 PM|
|Hierarchical cluster analysis vs Twostep cluster analysis in SPSSwith dicotomized data||Magnus Alderling||Newsgroup comp.soft-sys.sas||1||01-24-2005 05:51 PM|