Seven Anti-Patterns Found in Systems Papers

Philipp Leitner
7 min readOct 12, 2017

--

We all know and love and 9 circles of scientific hell (p-value hacking, overselling of data, etc.) that have prominently been going around social media a while ago. While interesting, it’s a pity that they have a fairly quantitative and stats-oriented slant to them, which makes them hard to relate to the more computer systems oriented research that I often work on with my students.

Of course this does not mean that there are no bad practices in those papers — quite the contrary, I am afraid to say. So, without further ado, I hereby present my personal best-of of questionable academic practices that I commonly observe in systems papers, and/or which I have done myself in the past. Due to my own experience, examples will be mostly drawn from services computing, cloud computing, and software engineering, but from what I have seen and read the same principles apply broadly in most systems research.

Words mean whatever I say they mean.

So let’s say you have produced a great research paper, but there is one problem: the results are kinda niche, and the domain that they are applicable to isn’t the most sexy or commercially relevant in the world, if it even exists. But those pesky reviewers at top conferences tend to get so much more excited when you can claim in your introduction that your paper is a really, really important contribution to that current buzzword domain.

Not to worry — words, and even moreso definitions, are patient, and don’t mind if you stretch them a little. Sure, your algorithm might essentially be an OS scheduling contribution, but why not call it container scheduling if that’s the buzz of the month? Alas, three problems tend to arise from that situation:

  • The “Loose Goose” is when, as much as you want it to, your work really does not fit the domain that you are desparately trying to sell it for. Consequently, you end up using the definitions and concepts in your target domain in the loosest way possible, and redefine the domain so that it suits your approach. In the best case, you are making a fool of yourself in the eyes of people that really know and care about your target domain, or, in the worst case, you end up seeding confusion in the academic community for years to come. Arguably the entire field of academic cloud computing research has suffered tremendously from this fate for its initial years, when you could never quite tell what words meant until you read the paper at least once to the end.
  • A variation of that is the “Lefty Loosy Righty Tighty”, where you use an overly broad, all-encompassing god definition of your domain concepts in the introduction to make sure that the reader understands how important your work is (“if we interpret a service to be everything from ordering a taxi to an OSGi component, then surely an approach to compose services correctly has to be the best thing since sliced bread?”). However, unlike in the loose goose, you then ditch your marketing definition right after the intro is done and use a conveniently more narrow definition of the term for the rest of the paper. That is, right when you have convinced the reader that they can use your work to compose their real-world trip to the airport, you start only talking about plugging OSGi services together. If you feel particularly ballsy you can even claim that your work can be “extended trivially”to everything else that somebody has ever called a service, optimally somewhere in the conclusions.
  • “The MacGuffin Research” (based on the well-known MacGuffin TV Trope) is when your work actually fits your target domain reasonably, but it also turns out to be completely free from novelty if removed from said domain. Entire early cloud computing papers have been written and accepted where a string replacement s/cloud/grid/g would not only have resulted in a similarly-coherent paper, but this paper would actually have been published in the same conference a few years prior (usually, but not necessarily, by the same authors). I fondly remember an early keynote talk I heard, where the speaker spent the first ten minutes of his keynote to explain how cloud computing is something completely different to grid computing, only to use the terms entirely interchangably in the rest of his talk.

Evaluated to be troublesome.

Even if you have your definitions in order, additional common anti-patterns loom in the evaluation sections of systems manuscripts. This is because evaluating systems research in a useful way can be freakishly difficult. Previous papers by other researchers are notoriously hard to reproduce, artefact sharing is at best a starting trend in the community, and the kind of claims typically found in systems research (“a new approach to XY”, “a framework to do Z”, …) don’t lend themselves naturally to the usual scientific method.

  • When faced with these challenges, researchers often fall back to the “Opportunistic Evaluation”, where they end up measuring not what needs to be evaluated, but whatever can be done easily. The very first paper I have ever written about my master’s thesis work cleanly falls into this category: I had proposed a dynamic Web service invocation framework, where the main claim was that it’s easier to use than standard systems. Ease-of-use is an annoying claim to evaluate, so I ended up just comparing the runtime performance of my tool to some industrial systems. One of the reviewers even commented on this discrepancy, which at that time irritated me to no end.
  • Another problem that my above work had at least to some extent is what I now like to call “The Baseline is an Idiot”. In such papers, an evaluation is conducted against the most stupid and trivially simple baseline that the researcher can think of. For example, let’s say you are writing a cloud scheduling algorithm to pack tasks onto VMs. You could of course re-implement some presumed-to-be-good standard algorithms from literature and compare your work against those — but you can also define your baseline to be packing everything onto a single VM, or using a new VM every time, or another similarly braindead “standard” approach, which not only is less work, but also virtually guarantees that your approach will come out on top.
  • Finally, and this is somewhat of a late entry that I have seen pop up recently, a systems evaluation can be of the “Evaluation by Magic 8-Ball” variety. This is when your evaluation, for instance of this fancy performance optimization, conclusively shows improvements in whatever metric you want to measure in 90% of your many test cases. It’s only a little troublesome that when you rerun your experiments you get an improvement in 64% of the cases, and the next time in 79%. It’s slightly more troublesome that the one time you accidentally disabled your optimization you still got an improvement in 69% of your test cases. This can be the case when the researchers are so thrilled about how well their experiments are working out that they forget about A/A testing, and do not consider that a large part of the difference they are seeing may be due to confounding factors, such as their test environment shifting below their feet. I nowadays generally tend to be very troubled if I don’t see an explicitly statement in papers that the authors have verified that their test environment reliably reports “no difference” when nothing has actually changed.

I can assume whatever I want.

The last anti-pattern, the “Proof by Weak Reference” is somewhat specific, but I know of at least one ICSE paper from a few years back which applied this quite fabulously. Say you have a really nice research paper, but there is one gaping hole in your entire story. One of your assumptions appears to be widely unrealistic. You suspect that it’s actually a quite fundamental flaw, which cannot be fixed easily. At a weaker conference, you may be able to get away with every researcher’s favorite cop-out, claiming that this aspect is “out of scope”. However, you really want to submit to a top conference, and you suspect the reviewers may call you out if you don’t address this somehow. So you write a second earlier paper at a non-selective venue, say a small workshop, where you propose a token approach to (attempt to) solve this issue. It’s important that this paper has a somewhat grandious title that sounds like you actually solved the issue therein. In your actual top-level paper you now point to the earlier work and claim that the suspicious assumption is “addressed thoroughly in [XXX]”. You get bonus points if, from your phrasing and reference style, it is not immediately evident that [XXX] is actually your own work. Even more bonus points to you if it’s not very clear that [XXX] appeared at a non-selective venue. This approach banks on the reviewers not really checking your references carefully, which tends to be a reasonable gamble.

Just like with the original 9 circles of scientific hell, these 7 anti-patterns of academic systems research are neither particularly rare nor necessarily used in bad faith. Even in the case I described above for “Proof by Weak Reference”, I am actually fairly certain that the authors acted opportunistically more than purposefully. I also want to re-iterate that I have definitely done a lot of this myself over the years — but of course it is never too late to start to do better, and listing these patterns explicitly is a good way to keep them for future reference.

What are your pet peeves when it comes to systems papers? Have I forgot something annoying you keep seeing? Let me know on Twitter or here in the comments.

--

--

Philipp Leitner
Philipp Leitner

Written by Philipp Leitner

Associate Professor of Software Enginering at Chalmers and the University of Gothenburg. Researcher in Cloud Computing and Internet Technologies.

No responses yet