Prepare data

Introduction

This vignette will show you how to prepare the DNA methylation dataset that you have read into the memory as shown in Read data vignette. The preparation is necessary to use this methylation data together with genetic data in interaction analyses in Haplin package.

library(HaplinMethyl)
#> Loading required package: Haplin

ex_path <- system.file("extdata", package = "HaplinMethyl")
ex_file <- "env_data_test.dat"
ex_out_file <- "dnam_ex"

dnam_ex <- envDataRead(
  file.in = ex_file,
  dir.in = ex_path,
  file.out = ex_out_file,
  sep = " ",
  overwrite = TRUE
)
#> Reading the data in chunks...
#>  -- chunk 1--
#>  -- chunk 2--
#> ... done reading.
#> Preparing data...
#> ... done preparing
#> Saving data...
#> ... saved to file: ./dnam_ex_env.ffData

Subsetting

Let’s check again what does dnam_ex object include:

dnam_ex
#> This is continuous environmental data read in by 'envDataRead'
#> with 400 columns
#> and 200 rows.
summary(dnam_ex)
#> List of 5
#>  $ class   : chr [1:2] "env.cont" "env.data"
#>  $ nrow    : int 200
#>  $ ncol    : int 400
#>  $ rownames: chr [1:200] "id1" "id2" "id3" "id4" ...
#>  $ colnames: chr [1:400] "cg1" "cg2" "cg3" "cg4" ...

When you don’t want to use the entire data, you can use envDataSubset function to easily specify which filters to apply to the dataset.

You can subset using row names, row numbers, column names, or column numbers. This is useful when you want to, e.g., extract only the measurements from one subgroup of samples or focus on specific CpGs.

Check the Finding CpGs vignette on how to use the functions that find CpGs within a defined region!

dnam_ex_3cpgs <- envDataSubset(
  env.data = dnam_ex,
  col.names = c("cg5", "cg7", "cg10"),
  file.out = "dnam_ex_3cpgs",
  overwrite = TRUE
)
#> Will select 3 columns.
#> opening ff /tmp/RtmpVechPi/ff/ff52eec5b90e4ba.ff
#> Saving data... 
#> ... saved to files: ./dnam_ex_3cpgs_env.ffData, ./dnam_ex_3cpgs_env.RData

This produces two new files: dnam_ex_3cpgs_env.ffData and dnam_ex_3cpgs_env.RData, which again can be used to load the data faster in the next R-session instance.

And the returned object is of the same class as the original one, but has only the three chosen CpGs:

dnam_ex_3cpgs
#> This is continuous environmental data read in by 'envDataRead'
#> with 3 columns
#> and 200 rows.
summary(dnam_ex_3cpgs)
#> List of 5
#>  $ class   : chr [1:2] "env.cont" "env.data"
#>  $ nrow    : int 200
#>  $ ncol    : int 3
#>  $ rownames: chr [1:200] "id1" "id2" "id3" "id4" ...
#>  $ colnames: chr [1:3] "cg5" "cg7" "cg10"

Categorizing

When the measurements are continuous (as it is usually with data from arrays), we need to first categorize it to use in Haplin. Thus, the level of DNA methylation at a single CpG or a CpG region will dictate the stratum membership for each sample.

dnam_ex_3cpgs_cat <- envDataCategorize(
  env.data = dnam_ex_3cpgs,
  breaks = 3,
  file.out = "dnam_ex_3cpg_cat",
  overwrite = TRUE
)
#> opening ff /tmp/RtmpVechPi/ff/ff52eec1dda0819.ff
#> Creating categories: 1,2,3
#> Saving data...
#> ... saved to file: ./dnam_ex_3cpg_cat_gen.ffData
dnam_ex_3cpgs_cat
#> This is categorical environmental data read in by 'envDataRead'
#> with 1 columns
#> and 200 rows.
class(dnam_ex_3cpgs_cat)
#> [1] "env.cat"  "env.data"
showRaw(dnam_ex_3cpgs_cat)
#> opening ff /tmp/RtmpVechPi/ff/ff52eeccd64809.ff
#>     [,1]
#> id1 2   
#> id2 2   
#> id3 3   
#> id4 2   
#> id5 1   
#> attr(,"vmode")
#> [1] byte
#> Levels: 1 2 3

Julia Romanowska

2024-01-17

Introduction

Subsetting

Categorizing