R: Plot the distribution of predictions for each class

plotd {earth}

R Documentation

Plot the distribution of predictions for each class

Description

Draw a plot of the distribution of the predicted values for each class. Can be used for earth models, but also for models built by lm, glm, lda, etc.

Usage

plotd(object, hist = FALSE, type = NULL, nresponse = NULL, dichot = FALSE,
      trace = FALSE, xlim = NULL, ylim = NULL, jitter = FALSE, main=NULL,
      xlab = "Predicted Value", ylab = if(hist) "Count" else "Density",
      lty = 1, col = c("gray70", 1, "lightblue", "brown", "pink", 2, 3, 4),
      fill = if(hist) col[1] else 0,
      breaks = "Sturges", labels = FALSE,
      kernel = "gaussian", adjust = 1, zero.line = FALSE,
      legend = TRUE, legend.names = NULL, legend.pos = NULL,
      cex.legend = .8, legend.bg = "white", legend.extra = FALSE,
      vline.col = 0, vline.thresh = .5, vline.lty = 1, vline.lwd = 1,
      err.thresh = vline.thresh, err.col = 0, err.border = 0, err.lwd = 1,
      xaxt = "s", yaxt = "s", xaxis.cex = 1, sd.thresh = 0.01, ...)

Arguments

To start off, look at the arguments object, hist, type.
For predict methods with multiple column responses, see the nresponse argument.
For factor responses with more than two levels, see the dichot argument.

`object`	Model object. Typically a model which predicts a class or a class discriminant.
`hist`	`FALSE` (default) to call `density` internally. `TRUE` to call `hist` internally.
`type`	Type parameter passed to `predict`. For allowed values see the `predict` method for your `object` (such as `predict.earth`). By default, `plotd` tries to automatically select a suitable value for the model in question. (This is `"response"` for all objects except `rpart` models, where `"vector"` is used. The choices will often be inappropriate.) Typically you would set `hist=TRUE` when `type="class"`.
`nresponse`	Which column to use when `predict` returns multiple columns. This can be a column index or column name (which may be abbreviated, partial matching is used). The default is `NULL`, meaning use all columns of the predicted response.
`dichot`	Dichotimise the predicted response. This argument is ignored except for models where the observed response is a factor with more than two levels and the predicted response is a numeric vector. The default `FALSE` separates the response into a group for each factor. With `dichot=TRUE` the response is separated into just two groups: the first level of the factor versus the remaining levels.
`trace`	Default `FALSE`. Use `TRUE` or `1` to trace `plotd` — useful to see how `plotd` partitions the predicted response into classes. Use `2` for more details.
`xlim`	Limits of the x axis. The default `NULL` means determine these limits automatically, else specify `c(xmin,xmax)`.
`ylim`	Limits of the y axis. The default `NULL` means determine these limits automatically, else specify `c(ymin,ymax)`.
`jitter`	Jitter the histograms or densities horizontally to minimize overplotting. Default `FALSE`. Specify `TRUE` to automatically calculate the jitter, else specify a numeric jitter value.
`main`	Main title. Values: `"string"` string `""` no title `NULL` (default) generate a title from the call.
`xlab`	x axis label. Default is `"Predicted Value"`.
`ylab`	y axis label. Default is `if(hist) "Count" else "Density"`.
`lty`	Per class line types for the plotted lines. Default is 1 (which gets recycled for all lines).
`col`	Per class line colors. The first few colors of the default are intended to be easily distinguishable on both color displays and monochrome printers.
`fill`	Fill color for the plot for the first class. For `hist=FALSE`, the default is 0, i.e., no fill. For `hist=TRUE`, the default is the first element in the `col` argument.
`breaks`	Passed to `hist`. Only used if `hist=TRUE`. Default is `"Sturges"`. When `type="class"`, setting `breaks` to a low number can be used to widen the histogram bars
`labels`	`TRUE` to draw counts on the `hist` plot. Only used if `hist=TRUE`. Default is `FALSE`.
`kernel`	Passed to `density`. Only used if `hist=FALSE`. Default is `"gaussian"`.
`adjust`	Passed to `density`. Only used if `hist=FALSE`. Default is `1`.
`zero.line`	Passed to `plot.density`. Only used if `hist=FALSE`. Default is `FALSE`.
`legend`	`TRUE` (default) to draw a legend, else `FALSE`.
`legend.names`	Class names in legend. The default `NULL` means determine these automatically.
`legend.pos`	Position of the legend. The default `NULL` means position the legend automatically, else specify `c(x,y)`.
`cex.legend`	`cex` for `legend`. Default is `.8`.
`legend.bg`	`bg` color for `legend`. Default is `"white"`.
`legend.extra`	Show (in the legend) the number of occurrences of each class. Default is `FALSE`.
`vline.thresh`	Horizontal position of optional vertical line. Default is `0.5`. The vertical line is intended to indicate class separation. If you use this, don't forget to set `vline.col`.
`vline.col`	Color of vertical line. Default is 0, meaning no vertical line.
`vline.lty`	Line type of vertical line. Default is `1`.
`vline.lwd`	Line width of vertical line. Default is `1`.
`err.thresh`	x axis value specifying the error shading threshold. See `err.col`. Default is `vline.thresh`.
`err.col`	Specify up to three colors to shade the "error areas" of the density plot. The default is `0`, meaning no error shading. This argument is ignored unless `hist=FALSE`. If there are more than two classes, `err.col` uses only the first two. This argument is best explained by running an example: data(etitanic) earth.mod <- earth(survived ~ ., data=etitanic) plotd(earth.mod, vline.col=1, err.col=c(2,3,4)) The three areas are (i) the error area to the left of the threshold, (ii) the error area to the right of the threshold, and, (iii) the reducible error area. If less than three values are specified, `plotd` re-uses values in a sensible manner. Use values of `0` to skip areas. Disjoint regions are not handled well by the current implementation.
`err.border`	Borders around the error shading. Default is `0`, meaning no borders, else specify up to three colors.
`err.lwd`	Line widths of borders of the error shading. Default is `1`, else specify up to three line widths.
`xaxt`	Default is `"s"`. Use `xaxt="n"` for no x axis.
`yaxt`	Default is `"s"`. Use `yaxt="n"` for no y axis.
`xaxis.cex`	Only used if `hist=TRUE` and `type="class"`. Specify size of class labels drawn on the x axis. Default is 1.
`sd.thresh`	Minimum acceptable standard deviation for a density. Default is `0.01`. Densities with a standard deviation less than `sd.thresh` will not be plotted (a warning will be issued and the legend will say `"not plotted"`).
`...`	Extra arguments passed to the predict method for the object.

Note

This function calls predict with the data originally used to build the model, and with the type specified above. It then separates the predicted values into classes, where the class for each predicted value is determined by the class of the observed response. Finally, it calls density (or hist if hist=TRUE) for each class-specific set of values, and plots the results.

This function estimates distributions with the density and hist functions, and also calls plot.density and plot.histogram. For an overview see Venables and Ripley MASS section 5.6.

Partitioning the response into classes

Considerable effort is made to partition the predicted response into classes in a sensible way. This is not always possible for multiple column responses and the nresponse argument should be used where necessary. The partitioning details depend on the types and numbers of columns in the observed and predicted responses. These in turn depend on the model object and the type argument.

Use the trace argument to see how plotd partitions the response for your model.

Degenerate densities

A message such as
Warning: standard deviation of "male" density is 0, density is degenerate?
means that the density for that class will not be plotted (the legend will say "not plotted").

Set sd.thresh=0 to get rid of this check, but be aware that histograms (and sometimes x axis labels) for degenerate densities will be misleading.

Using plotd for various models

This function is included in the earth package but can also be used with other models.

Example with glm:

      library(earth); data(etitanic)
      glm.model <- glm(sex ~ ., data=etitanic, family=binomial)
      plotd(glm.model)

Example with lm:

      library(earth); data(etitanic)
      lm.model <- lm(as.numeric(sex) ~ ., data=etitanic)
      plotd(lm.model)

Using plotd with lda or qda

The plotd function has special handling for lda (and qda) objects. For such objects, the type argument can take one of the following values:

"response" (default) linear discriminant
"ld" same as "response"
"class" predicted classes
"posterior" posterior probabilities

Example:

    library(MASS); library(earth); data(etitanic)
    lda.model <- lda(sex ~ ., data=etitanic)
    plotd(lda.model) # linear discriminant by default
    plotd(lda.model, type="class", hist=TRUE, labels=TRUE)

This handling of type is handled internally by plotd and type is not passed to predict.lda (type is used merely to select fields in the list returned by predict.lda). The type names can be abbreviated down to a single character.

For objects created with lda.matrix (as opposed to lda.formula), plotd blindly assumes that the grouping argument was the second argument.

plotd does not yet support objects created with lda.data.frame.

For lda responses with more than two factor levels, use the nresponse argument to select a column in the predicted response. Thus with the default type=NULL, (which gets automatically converted by plotd to type="response"), use nresponse=1 to select just the first linear discriminant. The default nresponse=NULL selects all columns, which is typically not what you want for lda models. Example:

    library(MASS); library(earth);
    set.seed(1)      # optional, for reproducibility
    example(lda)     # creates a model called "z"
    plot(z, dimen=1) # invokes plot.lda from the MASS package
    plotd(z, nresponse=1, hist=1) # equivalent using plotd
                                 # nresponse=1 selects first linear discr.

The dichot=TRUE argument is also useful for lda responses with more than two factor levels.

TODO

Handle degenerate densities in a more useful way.
Add freq argument for hist.

Examples

if (require(earth)) {
    old.par <- par(no.readonly=TRUE);
    par(mfrow=c(2,2), mar=c(4, 3, 1.7, 0.5), mgp=c(1.6, 0.6, 0), par(cex = 0.8))
    data(etitanic)
    mod <- earth(survived ~ ., data=etitanic, degree=2, glm=list(family=binomial))

    plotd(mod)

    plotd(mod, hist=TRUE, legend.pos=c(.25,220))

    plotd(mod, hist=TRUE, type="class", labels=TRUE, xlab="", xaxis.cex=.8)

    par(old.par)
}