Exercises

Author

Julia Romanowska

1 Introduction

What we’ll learn today?

  1. export data to a text-file

  2. create a table in R/STATA

  3. describe your data

  4. put on GitHub/Dataverse

2 What is a text file?

Files produced by default by Word or Excel are binary files, not text files!

text file not a text file
.txt .doc/.docx
.csv .xls/.xlsx
.tab .sav

A text file contains the characters we type explicitly, i.e., it can be opened and read in, e.g., Notepad.

NB: When exporting a table from Excel to CSV, make sure you did not do any formatting changes (numbers are still numbers) and that you have one table in one sheet.

3 Create a table

3.1 Table 1

This table you want to be visually appealing, to put in the submitted paper. From the raw data, there are several ways to get to the “Table 1” format, while at the same time keeping the code you used to generate it.

DISCLAIMER: I don’t know STATA well, so please advise your local STATA expert when you need help there.

3.1.1 In STATA

The newest, STATA 18, has a built-in command you can use to create a nice “Table 1”. Check the documentation here.

In older STATA versions, you can use some extra packages, e.g., table1_mc.

3.1.2 In R

One of the packages I use is {gtsummary}. The webpage has nice tutorials, but basically what it does is a summary of the dataset in one go.

Code
library(gtsummary)

table1 <- tbl_summary(
    trial,
    include = c(age, grade, response),
    by = trt, # split table by group
    missing = "no" # don't list missing data separately
  ) %>%
  add_n() %>% # add column with total number of non-missing observations
  add_p() %>% # test for a difference between groups
  modify_header(label = "**Variable**") %>% # update the column header
  bold_labels()
table1
Variable N Drug A, N = 981 Drug B, N = 1021 p-value2
Age 189 46 (37, 59) 48 (39, 56) 0.7
Grade 200

0.9
    I
35 (36%) 33 (32%)
    II
32 (33%) 36 (35%)
    III
31 (32%) 33 (32%)
Tumor Response 193 28 (29%) 33 (34%) 0.5
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

You can use the output as is or stylize it with, e.g., {gt} package. When you’re done, you can export it to many formats: .docx, .png, .html, .rtf, .tex, .ltx.

Code
table1 %>%
  as_gt() %>%
  gt::gtsave(filename = "tab01_study_pop_description.docx")

3.2 Any table with results

On the other hand, when you have some results, you would like to save them in a file format that can be easily read by anyone, even without access to Word or Excel. Moreover, when you want to import a dataset to analyze it, it is always much easier to import a text file (see Section 2).

Importantly, the exported results should be raw, i.e.:

  • numbers are numbers (e.g., “1e-05” in a p-value column is not a number),
  • numbers are not rounded (e.g., “<0.05” in a p-value column),
  • dates are in ISO format (“YYY-MM-DD”).

When sharing your results as a supplementary material to a publication, make sure you’re sharing raw results in a text file.

3.2.1 In STATA

After collecting the results from your analysis (with the collect command), use the command outsheet. You can read about this command in the official documentation and in a short tutorial.

3.2.2 In R

I recommend using {tidyverse} and {here} packages.

{here} package will help you define the path to the files based only on your current project “home” location, so that you don’t have to think if the files are saved on C:/My strange location or maybe S:/My other folder.

{tidyverse} is a group of packages, where one of them, {readr}, handles reading and writing files very nicely. To save a .csv file, write:

Code
library(readr) # or: library(tidyverse)

write_csv(
  object,
  file = here("RESULTS", "01_linreg_no_adj_results.csv")
)

This will write your object to the file under the directory “RESULTS”. If you prefer a tab-delimited file instead, use write_delim() function.

If using base R, just write:

Code
write.csv(
  object,
  file = here("RESULTS", "01_linreg_no_adj_results.csv"),
  row.names = FALSE
)

4 Describe your data

Now, that you have exported your results into a simple text file, don’t stop there! The data is not yet ready to be published because there might be column names that are not understandable, categorical data that is represented as numbers, or special issues with data that needs explanation before one can proceed with their analysis.

Therefore… metadata is as important as the raw data!

Create a short document (in Word, in Markdown, in HTML - whatever you prefer) where you explain everything that you know about this data.

For good examples, checkout the {medicaldata} package. In R, all the available datasets have an explanation which you can view by using ? command, e.g.,

Code
?medicaldata::covid_testing

5 (*) Put on GitHub/Dataverse

When you’re done with all that, you can add the produced files to the supplementary material where you publish the paper. However, you can always make it more discoverable by publishing in a public repository.

GitHub/GitLab is a service where you can store files and their versions, with history of changes. It’s therefore useful to start a repository when you start your analysis (it can be a private repository), so that you know what has been done.

Dataverse is a database where you can store and share data/results and get a citable DOI.