Creating categorical variables out of continuous data — envDataCategorize • HaplinMethyl

This function prepares the environmental data to be used in stratification when calling haplinStrat.

Usage

envDataCategorize(
  env.data = stop("You didn't provide the environmental data!", call. = FALSE),
  summary.method = "sum",
  breaks,
  file.out = "env_data_cat",
  dir.out = ".",
  overwrite = NULL
)

Arguments

env.data: The environmental data, read in by envDataRead function.
summary.method: If there are more than one probe (rows), this method is used to summarize the continuous data across columns to create one number per row (sample), which will be then used to categorize data.
breaks: Numerical vector indicating how to divide the continuous values into categories (see Details).
file.out: The core name of the files that will contain the categorized data (character string); ready to load next time with envDataLoad function; default: "env_data_cat".
dir.out: The directory that will contain the saved data; defaults to current working directory.
overwrite: Whether to overwrite the output files: if NULL (default), will prompt the user to give answer; set to TRUE, will automatically overwrite any existing files; and set to FALSE, will stop if the output files exist.

Value

A list of ff matrices, now containing the categorized data (factors). The function also creates two files: .RData and .ffData.

Details

The `env.data` given here is assumed to be a set that is somehow linked to each other, e.g., if the data is DNA methylation measurements for various CpGs, the CpGs might be from one region around a given SNP.

The `summary.method` argument takes a value from the following list:

`sum` - arithmetical sum of the values (default);
`average` - average value;
`NULL` - no summary; **NOTE:** if the data contains more than one column, be sure to check that the breaks give correct division for each of the column!

When `breaks` is one number, it defines the number of categories that the range of values will be divided into. The categories will be equal in size, based on the appropriate quantiles calculated from the summarized values.