Files produced by default by Word or Excel are binary files, not text files!
text file
not a text file
.txt
.doc/.docx
.csv
.xls/.xlsx
.tab
.sav
A text file contains the characters we type explicitly, i.e., it can be opened and read in, e.g., Notepad.
NB: When exporting a table from Excel to CSV, make sure you did not do any formatting changes (numbers are still numbers) and that you have one table in one sheet.
3 Create a table
3.1 Table 1
This table you want to be visually appealing, to put in the submitted paper. From the raw data, there are several ways to get to the “Table 1” format, while at the same time keeping the code you used to generate it.
DISCLAIMER: I don’t know STATA well, so please advise your local STATA expert when you need help there.
3.1.1 In STATA
The newest, STATA 18, has a built-in command you can use to create a nice “Table 1”. Check the documentation here.
In older STATA versions, you can use some extra packages, e.g., table1_mc.
3.1.2 In R
One of the packages I use is {gtsummary}. The webpage has nice tutorials, but basically what it does is a summary of the dataset in one go.
Code
library(gtsummary)table1 <-tbl_summary( trial,include =c(age, grade, response),by = trt, # split table by groupmissing ="no"# don't list missing data separately ) %>%add_n() %>%# add column with total number of non-missing observationsadd_p() %>%# test for a difference between groupsmodify_header(label ="**Variable**") %>%# update the column headerbold_labels()table1
Variable
N
Drug A, N = 981
Drug B, N = 1021
p-value2
Age
189
46 (37, 59)
48 (39, 56)
0.7
Grade
200
0.9
I
35 (36%)
33 (32%)
II
32 (33%)
36 (35%)
III
31 (32%)
33 (32%)
Tumor Response
193
28 (29%)
33 (34%)
0.5
1 Median (IQR); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test
You can use the output as is or stylize it with, e.g., {gt} package. When you’re done, you can export it to many formats: .docx, .png, .html, .rtf, .tex, .ltx.
On the other hand, when you have some results, you would like to save them in a file format that can be easily read by anyone, even without access to Word or Excel. Moreover, when you want to import a dataset to analyze it, it is always much easier to import a text file (see Section 2).
Importantly, the exported results should be raw, i.e.:
numbers are numbers (e.g., “1e-05” in a p-value column is not a number),
numbers are not rounded (e.g., “<0.05” in a p-value column),
dates are in ISO format (“YYY-MM-DD”).
When sharing your results as a supplementary material to a publication, make sure you’re sharing raw results in a text file.
3.2.1 In STATA
After collecting the results from your analysis (with the collect command), use the command outsheet. You can read about this command in the official documentation and in a short tutorial.
{here} package will help you define the path to the files based only on your current project “home” location, so that you don’t have to think if the files are saved on C:/My strange location or maybe S:/My other folder.
{tidyverse} is a group of packages, where one of them, {readr}, handles reading and writing files very nicely. To save a .csv file, write:
Now, that you have exported your results into a simple text file, don’t stop there! The data is not yet ready to be published because there might be column names that are not understandable, categorical data that is represented as numbers, or special issues with data that needs explanation before one can proceed with their analysis.
Therefore… metadata is as important as the raw data!
Create a short document (in Word, in Markdown, in HTML - whatever you prefer) where you explain everything that you know about this data.
For good examples, checkout the {medicaldata} package. In R, all the available datasets have an explanation which you can view by using ? command, e.g.,
Code
?medicaldata::covid_testing
5 (*) Put on GitHub/Dataverse
When you’re done with all that, you can add the produced files to the supplementary material where you publish the paper. However, you can always make it more discoverable by publishing in a public repository.
GitHub/GitLab is a service where you can store files and their versions, with history of changes. It’s therefore useful to start a repository when you start your analysis (it can be a private repository), so that you know what has been done.
Dataverse is a database where you can store and share data/results and get a citable DOI.