Introduction
This vignette will show you how to prepare the DNA methylation
dataset that you have read into the memory as shown in
Read data
vignette. The preparation is necessary to use
this methylation data together with genetic data in interaction analyses
in Haplin
package.
library(HaplinMethyl)
#> Loading required package: Haplin
ex_path <- system.file("extdata", package = "HaplinMethyl")
ex_file <- "env_data_test.dat"
ex_out_file <- "dnam_ex"
dnam_ex <- envDataRead(
file.in = ex_file,
dir.in = ex_path,
file.out = ex_out_file,
sep = " ",
overwrite = TRUE
)
#> Reading the data in chunks...
#> -- chunk 1--
#> -- chunk 2--
#> ... done reading.
#> Preparing data...
#> ... done preparing
#> Saving data...
#> ... saved to file: ./dnam_ex_env.ffData
Subsetting
Let’s check again what does dnam_ex
object include:
dnam_ex
#> This is continuous environmental data read in by 'envDataRead'
#> with 400 columns
#> and 200 rows.
summary(dnam_ex)
#> List of 5
#> $ class : chr [1:2] "env.cont" "env.data"
#> $ nrow : int 200
#> $ ncol : int 400
#> $ rownames: chr [1:200] "id1" "id2" "id3" "id4" ...
#> $ colnames: chr [1:400] "cg1" "cg2" "cg3" "cg4" ...
When you don’t want to use the entire data, you can use
envDataSubset
function to easily specify which filters to
apply to the dataset.
You can subset using row names, row numbers, column names, or column numbers. This is useful when you want to, e.g., extract only the measurements from one subgroup of samples or focus on specific CpGs.
Check the
Finding CpGs
vignette on how to use the functions that find CpGs within a defined region!
dnam_ex_3cpgs <- envDataSubset(
env.data = dnam_ex,
col.names = c("cg5", "cg7", "cg10"),
file.out = "dnam_ex_3cpgs",
overwrite = TRUE
)
#> Will select 3 columns.
#> opening ff /tmp/RtmpVechPi/ff/ff52eec5b90e4ba.ff
#> Saving data...
#> ... saved to files: ./dnam_ex_3cpgs_env.ffData, ./dnam_ex_3cpgs_env.RData
This produces two new files: dnam_ex_3cpgs_env.ffData
and dnam_ex_3cpgs_env.RData
, which again can be used to
load the data faster in the next R-session instance.
And the returned object is of the same class as the original one, but has only the three chosen CpGs:
dnam_ex_3cpgs
#> This is continuous environmental data read in by 'envDataRead'
#> with 3 columns
#> and 200 rows.
summary(dnam_ex_3cpgs)
#> List of 5
#> $ class : chr [1:2] "env.cont" "env.data"
#> $ nrow : int 200
#> $ ncol : int 3
#> $ rownames: chr [1:200] "id1" "id2" "id3" "id4" ...
#> $ colnames: chr [1:3] "cg5" "cg7" "cg10"
Categorizing
When the measurements are continuous (as it is usually with data from
arrays), we need to first categorize it to use in Haplin
.
Thus, the level of DNA methylation at a single CpG or a CpG region will
dictate the stratum membership for each sample.
dnam_ex_3cpgs_cat <- envDataCategorize(
env.data = dnam_ex_3cpgs,
breaks = 3,
file.out = "dnam_ex_3cpg_cat",
overwrite = TRUE
)
#> opening ff /tmp/RtmpVechPi/ff/ff52eec1dda0819.ff
#> Creating categories: 1,2,3
#> Saving data...
#> ... saved to file: ./dnam_ex_3cpg_cat_gen.ffData
dnam_ex_3cpgs_cat
#> This is categorical environmental data read in by 'envDataRead'
#> with 1 columns
#> and 200 rows.
class(dnam_ex_3cpgs_cat)
#> [1] "env.cat" "env.data"
showRaw(dnam_ex_3cpgs_cat)
#> opening ff /tmp/RtmpVechPi/ff/ff52eeccd64809.ff
#> [,1]
#> id1 2
#> id2 2
#> id3 3
#> id4 2
#> id5 1
#> attr(,"vmode")
#> [1] byte
#> Levels: 1 2 3