Tag: open science

Working your flow.

Grad school is essentially a juggling act: you not only have to perform in your classes, but you have to work as a TA or RA, do your own research, and make sure that you nurture your social life so that you don’t go insane. One way to make this task much more manageable can be summed up in the immortal words of Scrouge McDuck:

One way to work smarter, not harder, is to make sure you’re using the right technological tools. For me, in particular, two issues stuck out while writing my master’s thesis:

  1. Finding a way to easily collaborate with my committee members
  2. Finding a way to deal with numerous drafts

These were sort of interrelated issues, too. At times, I had already written several more drafts by the time I got comments back on a draft I had e-mailed to a committee member, which could lead to a bit of confusion or simply waste time if I had already spotted and corrected an issue that was commented on. These issues really apply to all research, though, as all research can (and to some degree probably should) involve collaboration and numerous rewrites.

My solution to these two issues has been to develop a workflow that treats writing a paper more like writing software. In a way, I mean this literally, as my current workflow involves programming rather than using Word to write papers or Excel to manage data or something like SPSS to perform statistical analyses. These programs are nice in that they’re pretty easy to learn, but they also don’t integrate very smoothly and transparently with each other, don’t play well with the internet/cloud systems, and create all sorts of compatibility issues if your collaborators aren’t using the same exact tools. The alternative is using tools that involve only standardized, text-based file types. This means learning how to code. I know that sounds scary for many people, and the learning curve is certainly higher than figuring out where the table button is in Word, but the learning curve is in my opinion often overstated, and the payoff of overcoming that curve is pretty great. Remember, work smarter, not harder.

The first component of my current workflow is the website ShareLaTeX.1 This site allows you to produce .pdf documents using a markup language called LaTeX. A markup language is a very simple programming language that lets you format plain text by occasionally inserting little tags when you need to make something bold or create a header or whatever. You write your text up with your little tags in an input file, in this case a .tex file, which then spits out a .pdf document when you compile it. For instance, if I wrote the following in my .tex file:

This is my super cool first section where I talk about DuckTales.
\section{Get to the point}
This is my next section where I get to the point and say something useful.

I would get a .pdf that looks something like this:

This is my super cool first section where I talk about DuckTales.

Get to the point

This is my next section where I get to the point and say something useful.

That’s more or less it. You can do quite a bit with LaTeX (there are numerous independently developed packages that extend its capabilities even beyond what the base system can do), but, for many researchers, you can learn almost all you’ll need to know to use it with just a few days of running through tutorials and/or hunting down the tags that will allow you to create whatever you need, e.g. footnotes, tables, citations, perhaps syntax trees for linguists, etc. There are many offline editors that allow you to write and compile .tex files, but ShareLaTeX itself is an online editor, so you can avoid figuring out how to install LaTeX and an editor on your laptop by using the site. One added bonus, too, is that if the phrase “compiling a .tex file” sounds intimidating to you, ShareLaTeX simply gives you a “Compile” button that does it all for you and shows you the resulting .pdf document.

ShareLaTeX has many other bonuses, though, because it’s really a collaboration tool. You create a project and invite collaborators or advisors who then have real time access to anything that’s in that project and any changes that are being made. In my case, a project might include a .tex file, a .bib file2 containing my list of references (a standardized, human-readable text file format for automatically handling citations in .tex documents), .csv files for data (again, a standardized, human-readable text file format), and .R scripts that perform statistical analyses and produce figures and tables (which are again non-proprietary, human-readable text files). Collaborators can comment on the text, check out the data that text was based on, and see exactly how you analyzed it, all in one place. ShareLaTeX even has a form of version control so that you can get back to an earlier draft of your paper if necessary and collaborators can see how each paragraph has been changed. It’s basically like a super-powered Google Docs and ultimately far more efficient than trying to create your own version control system out of a bunch of Word documents that you then have to e-mail back and forth to each collaborator separately.

Another big advantage of writing your papers in LaTeX is that you can add R code directly into a LaTeX document3 via an R package called knitr. What this means is that when your analyses change or your data changes, your write-up will automatically be updated, as well. No longer do you need to tediously figure out how you generated a figure or a number, go back to another program, change your analysis, regenerate the figure or number, create an image or something of that sort, switch the old image with the new one in your document, and then hope that you didn’t miss a reference to it somewhere else in your paper. Instead, find the relevent paragraph in your .tex file, change a number or whatever, press compile, and you’re done. Remember, work smarter, not harder.

What this also means is that you don’t want to just learn LaTeX, you want to learn R, too. R is another programming language that’s specifically designed for doing statistics. There’s more of a learning curve for R than for LaTeX, but R is also extremely popular. There’s a very good chance that your university offers seminars for learning it or that you have a colleague who knows it already. In the unlikely chance that neither of these things are true, there are also a huge number of online tutorials and free courses for learning it.4 As with LaTeX, R can do far more than what you need it to do for you, so the trick is to learn some basics and then focus on learning what you need for whatever project you’re doing. In my case, I taught myself enough R to analyze all the data for my thesis in about two weeks.

So ShareLaTeX provides an online environment to store your write-up, your references, your data, and your analyses. It also provides exceptional tools for collaborating. What it’s not so great for is sharing your work with the world. I don’t just mean publishing a paper, but making your data and analyses available to the public and other researchers for free, i.e. partaking in open science. To accomplish this, we need a second component for our workflow, a website called GitHub. GitHub technically exists for developing open source software, so why use it for science? Because this is the future, and in the future we do open science, and we write papers as if we’re writing software.

Another reason is because we can automatically sync our projects to GitHub from ShareLaTeX, and GitHub can then sync them to numerous other sites. You can can even sync a GitHub project with a local version on your laptop using Git, which is simply the version control software that GitHub itself uses. In this way, you can work without internet access but still maintain a consistent system of drafts and rewrites that don’t get confusing. In fact, that’s really the whole purpose of Git and GitHub. They keep track of changes to text files (e.g. .tex files or .bib files or .csv files or .R script files) so that you don’t have to. This combined with GitHub’s popularity make it the perfect tool to act as the hub for your workflow.

But GitHub is also very social. Once you have a project on the site, anyone can make their own copy of it and do what they want with it. Perhaps they had a better idea for how you could have done a statistical analysis: they can literally redo it themselves and then merge their changes back into your project, with your approval of course. Perhaps they want to replicate your study or apply new analyses to your data: this is all perfectly simple once the project is on GitHub. This is how open source software development has worked for a while, and scientific research can just as easily benefit from this sort of workflow and openness.

Still, GitHub is not a science-centric site. This means that it’s missing two important elements: 1) it doesn’t make your project super-visible to other researchers, and 2) it doesn’t facilitate open science processes like preregistrations and publishing preprints.5 Luckily, GitHub allows you to sync your project to the Open Science Framework (OSF), the third component of our workflow, which can handle all of the above. The OSF is not so great for collaboration, even though it has a rudimentary commenting system and a rudimentary version control system, but it’s an ideal place to create preregistrations, increasing validity and transparency for your research, which can then all be linked to preprint versions of your research that can be uploaded to preprint servers that the OSF also hosts, like arXiv or, if you’re in the social sciences like me, SocArXiv. In fact, ShareLaTeX, once merged with Overleaf, will most likely support direct submission to preprint servers, which includes formatting and all, since Overleaf currently has this feature.

So, to summarize, the workflow described here has four components:

  1. ShareLaTeX (your main work area for you and your named collaborators)
  2. GitHub (a central hub that makes your work public and allows for anonymous collaboration)
  3. The OSF (fulfills all your open science and study design validity needs)
  4. Git (your offline tool for working without internet access)

Using these tools involves more of a learning curve than the old-fashioned Word + e-mail methods (you’ll need to learn some LaTeX, some R, and how Git and GitHub work, though these latter two can really be learned in just a couple hours), but once you get over that curve, your life will be significantly easier. You’ll be able to spend your time thinking about your research instead of thinking about how to manage your research and how to keep everyone in the loop. This is the essence of working smarter, not harder: if you put a little more effort in on the front end, you’ll catch up and get far ahead on the back end.


  1. At the time of this writing, ShareLaTeX is in the processing of merging with a similar site called Overleaf, which will eventually yield Overleaf v2. I’ve been beta testing v2, however, and it appears, thankfully, to be almost identical to ShareLaTeX.
  2. This is a really important concept for working smarter, not harder, as well: do not ever deal with references and citations by hand. Personally, I recommend using Zotero for reference management. Zotero allows you to import sources, complete with .pdf’s, with the push of a button in your web browser, and it syncs these sources to its website as well as any other computers that you’re running the application on. You can then create .bib files directly from Zotero, which allows you to create in-text citations and automatically generate bibliographies in your .tex LaTeX document.
  3. On ShareLaTeX, you rename your file from .tex to .Rtex to accomplish this. Otherwise, nothing changes.
  4. To maybe a lesser extent, this is all true of LaTeX, as well. LaTeX has been around for decades and has been the method of choice for writing papers in many math-heavy fields.
  5. I won’t get into the benefits of using registrations and preprints here, as I’m just trying to outline an effective workflow, but I highly recommend looking into them.

Interpreting uninterpretable P-values.

Lately, I’ve been trying to learn more about open science and how it relates to research I’ve done, research I’d like to do, and how it relates to sociolinguistics in general. One topic that comes up regularly when talking about open science is pre-registration. For those who aren’t familiar with this process, pre-registration refers to publishing a detailed, time-stamped description of your research methods and analyses on some repository before ever actually looking at your data. Doing so increases transparency for the research and helps the researcher avoid P-hacking, aka data fishing1. There are apparently some arguments against pre-registering research, but I’ve yet to see any that don’t mischaracterize what pre-registration actually is, so it seems like a no brainer to do it.

But in looking into the actual mechanics behind producing a pre-registration, I ended up watching the following webinar from the Center for Open Science (COS) about using their Open Science Framework (OSF) to publish pre-registrations, which included this curious description of how to interpret P-values in different kinds of research2:

Basically, the claim is that pre-registration makes it clear which analyses are confirmatory3 and which are exploratory, which is great, but the other part of the claim is that P-values are uninterpretable in exploratory research. In other words, any P-values that are generated through analyses that weren’t pre-registered, i.e. through data fishing, are meaningless.

I can understand why this point is made, but I think it’s a bad point. Pre-registration does seem to create another level in the hierarchy of types of research — i.e. exploratory (observational, not pre-registered) > confirmatory (observational, pre-registered) > causal (experimental) — but I see no reason why P-values are uninterpretable at the exploratory level. It would seem that P-values are perfectly valid at all levels, and all that changes is how they should be interpreted, not whether they can be interpreted at all. To me, in experimental research, a P-value helps one argue for a causal relationship, whereas in confirmatory observational studies, a P-value helps one argue that some relationship exists, though not necessarily a causal one, and in exploratory observational research, a P-value simply suggests that there might be a relationship and so that potential relationship should be explored further in future research.

In the case of my thesis, I did employ P-values via Fisher’s exact test of independence, but I didn’t pre-register my analyses. That’s not to say that all my analyses were exploratory, just that I have no proof that I wasn’t data fishing. Indeed, I included variables that didn’t make any sense to include at all4, but still somehow turned out to be statistically significant, such as whether there was a relationship between the person who coded each token of my linguistic variable, (lol), and how that variable was realized. The webinar initially made me panic a bit, asking myself if it was irresponsible to have included P-values in my analyses, but after further reflection, I think it was completely justified. Most of my analyses were confirmatory anyway, even though I don’t have proof of that, and those that were arguably exploratory were still more useful to report with P-values as long as an explanation for how to interpret those P-values was also included, which is perhaps the one place where I could’ve done better.

Ultimately, while I can understand why there’s so much focus on data fishing as a negative thing, I think it’s important to not overshoot the mark. P-values can certainly be misused, but that misuse seems to come down to not providing enough information to allow the reader to properly interpret them, not to whether they were included when they shouldn’t have been.


1. I prefer the term data fishing, which can be more easily taken in both a negative and a positive way, whereas P-hacking sounds like it’s always negative to me. The Wikipedia article on data fishing gives a pretty clear explanation of what it is, for those who are unaware.
2. The webinar is really good, actually. I would suggest that anyone who’s new to open science watch the whole thing.
3. In this case, the speaker seems to be using the term “confirmatory research” as something different from “causal research”, otherwise their description doesn’t make any sense.
4. In fact, my thesis advisor didn’t see the point in me including these variables at all.

© 2024 Josh McNeill

Theme by Anders NorenUp ↑