The people who build software in the land of dataspeak

If data are random heaps of givens that have no relevance in themselves, how do we make sense of the chaos?

From the keyboard of CEO JORIS VAN EIJNATTEN, Bits & Bytes is a thought leadership series which explores relevant or intriguing topics in the world of digital research. From software and digital humanities to current trends in academia and more, join us as Joris explores — and explains. Feedback or something you’d like to see addressed? Start a conversation by emailing us.

Dataspeak. In current academia it’s overwhelming. For research software engineers, it can often be the bane of their work, leading to many misunderstandings. When we listen through our own filters – or not at all – distorted views can become mainstream. But there are other ways of framing the digital in research.

Don’t get me wrong. Data are important. Data are building blocks of information, and we live in an information society. Without data we’d be helpless. Without data we wouldn’t be able to pay taxes, order from web shops, plan houses or build highways. We wouldn’t be able to address climate crises or resolve pandemics.

Yet in the end, data are random heaps of givens that have no relevance in themselves. Data are a series of essentially pointless bits and bytes located in the space-time continuum of some digital repository. They are a means to potentially reduce uncertainty. We can’t change the world on a promise, however, so why all the fuss about data? Why are data all we hear about? What are we missing here?

Introducing RUMP
We live in a land of dataspeak. Governments cumulatively invest billions in ‘data repositories’ and ‘data infrastructures’ for research. They have supported the development of a code of conduct for data. The code suggests that data be stored in such a way that users can retrieve them with relative ease. The underlying principle is that science should be open. That is one reason why politicians and policy makers put great store by FAIR. FAIR is about making data Findable, Accessible, Interoperable and Reusable. It’s a moral code all researchers are expected to follow.

Some researchers follow the code, though not always with much enthusiasm. Their reluctance stems in part from their dislike of having yet another requirement to fulfil. But it also follows from the naïve assumption implicit in dataspeak that FAIR data is the be-all and end-all of research. Naturally, this reluctance does not make FAIR data less plausible and important. It goes without saying that the bits and bytes representing data should be findable. Data must also be accessible, of course. If possible, data should be interconnected. As for recycling data, by all means! Research data is a public commodity and anyone should be able to use or reuse them anywhere, anytime.

Yet, after everything has been said and done, once the data are out there and have been made suitably FAIR, they will still have no significance in themselves. They will still be a motley collection of pointless bits and bytes stockpiled somewhere on an anonymous server farm. They will be RUMP: Random, Unintelligible, Meaningless and Purposeless. They’ll be FAIR, and that’s great. But they’ll also be RUMP.

Dataspeak is all about acronyms, which is why I am introducing RUMP. RUMP is as adequate a summary as any of the core meaning of FAIR data. But the question remains: if data as such have no meaning, why all the fuss? What are we missing?

It’s the clever part. What we’re missing are intelligent means to de-RUMP research data, to make them Consequential, Useful, Significant and Purposeful (that would be CUSP, but let’s not overdo the acronyms). This is where research software enters into the equation. Research software and, more particularly, the people who build and use it, are crucial to de-RUMPing data, regardless of whether that data is open or FAIR or anything else.

On research software
Without data we’d be helpless. We all agree on that. But without software all the exa, zetta and yottabytes of all the research data in all the world would be utterly useless. Without software we might just as well return to the pre-digital world, when technology boiled down to integrating the analogue with the mechanical. Data were written on paper, and they were just as RUMP back then as they are now. The only software required was embedded in the human brain. People took the brain for granted.

Now, however, we humans have partly outsourced our brains to computers. Software is the stuff that instructs computers to make sense of data. It’s all very well to have fancy machines that operate at a hundred quadrillion flops, and huge loads of research data that fit ethical standards. Without the software to make them run, investing time and energy in hardware and data would be a very, very expensive waste of space and time. All hardware and all data would be completely superfluous.

Unfortunately, politicians and policy makers have long been fixated on either hardware or data or both. It isn’t difficult to see why. Hardware appeals to group sentiments in male-dominated cultures. Like shiny cars and hi-fi equipment, high performance computing often suffers from the “mine is bigger, better or faster than yours” syndrome. More recently, the powers that be have been single-mindedly preoccupied with research data. That fits in well with present-day politics, where privacy, security, inclusivity, diversity and transparency are at the forefront of things.

No wonder that data is an easy score, and that dataspeak has become rampant. Meanwhile, regrettably, software, like the brain, is taken for granted. And so are the people who build it. 

Nomenclature in dataspeak
Nobody wants to be taken for granted. It almost seems as if the people who actually ensure that research data is useful in the first place (that is, the people who de-RUMP data by building research software) don’t exist. The lack of clarity of their position is reflected in the names used to identify them, if officials and bureaucrats take the trouble of identifying them in the first place.

Admittedly, not recognising the people who build research software at all is still better than branding them with classifiers gleaned from dataspeak. There is one thing you should know about dataspeak: ‘data’ is central to the lingo.

Dataspeak often refers to an entity known as the ‘data steward’. This is best defined as a person who ensures that RUMP data are FAIR. The data steward comes close to another entity, the ‘data manager’. This is someone who, well, manages data. Both data stewards and data managers are support professionals for research. They sometimes come under other names than steward and manager. They are undeniably important additions to the research landscape.

My point is that people who build research software, the data de-RUMPers, do not belong to this group. They are not (and I repeat: not) support professionals. Instead, they are very closely allied to a better-known entity called the ‘researcher’, who existed long before dataspeak emerged in dataland. The people who build research software are academic researchers who devise digital methods and write computer code. Because of the specificity of their activities they come under different names, and this is where, in the era of dataspeak, things can become confusing.

In dataspeak’s administrative officialese, the people who build research software are sometimes called data stewards. This is patently absurd. It’s like mistaking the architect for the bricklayer. More often, though, relatively old-fashioned names are employed, reminiscent of IBM mainframes and punch cards, or – if you will – personal computers and SPSS. The result is that ‘data analyst’ and ‘data scientist’ are still used to denote advanced research personnel, who now do (or should be able to do) much more than merely hit a button on a machine they don’t understand. Since everything needs to be forced into the mould of dataspeak, all nomenclature that includes the word ‘data’ is anxiously preserved.

This is where the term ‘RSE’ comes in. Heard of it? Probably not: after all, it isn’t dataspeak. And yet RSEs are the people on whom de-RUMPing data hinges. They are the people who not only develop digital methods and write code but design and engineer them to boot. Without RSEs, twenty-first-century digital academia just wouldn’t work.

The pros and cons of RSE
Few people will recognise the term ‘RSE’, even if it were spelt out in full as ‘Research Software Engineer’. Perhaps that isn’t surprising. For one, dataspeak prevails. Also, the term is a foreign invention, not just to the Dutch but to the larger part of the globe. Like most neologisms in the world today, this one too has Anglophone origins, which means that it’s basically untranslatable. ‘Research Software Engineer’ doesn’t lend itself to countries attempting to centralize their language policy. Think France, for instance.

RSE isn’t the most fortuitous combination of words. The third element in the triad is most problematic. For historical reasons, the word ‘engineer’ suggests a certain status in the academic hierarchy. An engineer is often identified as someone who routinely applies the scientific innovations thought up by ‘real’ academics. From this perspective, software boils down to writing lines of code, and ‘software engineers’ to code monkeys who serve scientists. ‘Research software’ becomes a means to an end, and it is the end, and only the end, that counts.

You may remember Sheldon Lee Cooper, Ph.D., the nerdy theoretical physicist who starred in the TV series Big Bang Theory. Nobody made this particular point with less subtlety than he. “So!”, he explained in one of the many episodes, “This is engineering, huh? Engineering! Where the noble, semi-skilled laborers execute the vision of those who think and dream. Hello, Oompa Loompas of science!”

Those who oppose the effects of Anglophone globalism may well come up with their own version of RSE. For the Dutch, however, objecting to the term is a rear-guard action. ‘RSE’ is probably not going away. There really is no alternative.

Methodologists or engineers?
What would an alternative look like, anyway? Not so long ago, someone suggested ‘Research Methods Innovation Specialist’ to me. It’s spot on, but a tongue twister, too long to administrate and much too complex to understand. Yet each individual word in this mouthful is closer to the essence of the people who build research software  than ‘Research Software Engineer’. Shorten the mouthful to ‘Digital Methodologist’? That’s better, but doesn’t really help. Nobody wants to be known as a methodologist.

Fortunately, there is some merit in ‘RSE’. The Dutch, who tend not to be averse to exotic if invasive loanwords, already employ the term ingenieur. As an academic title, ir. or ing. is bestowed on engineers as the equivalent of MSc. This species of academic degree is used in high-level engineering faculties. Ingenieurs are those who currently devise the next green revolution in agriculture, or work on ways to implement atomic fusion. They are full-blooded academics who do applied research.

The truth is that we need a generic term for the people who build research software, and taken separately, each of the constituent words in RSE does seem to work.[1] RSEs are:

  1. academic researchers who do applied rather than fundamental research, and many have PhDs to prove it;
  2. methodological specialists who build or adapt software that makes sense of research data;
  3. engineering experts who design and develop research software that is robust, reliable, reusable and sustainable.

The problem is not so much that the term RSE isn’t common usage yet. It is the general lack of recognition we bestow on researchers who identify with what the term RSE stands for. Like ingenieurs, RSEs are university-level specialists who do research into digital technologies and methodologies. They are the immediate equivalents of postdocs, assistant and associate professors, and sometimes top-level technicians who populate most universities.

Beyond dataspeak
Let me come back to the question I posed at the beginning of this blog. What will we be missing if people continue to converse in dataspeak? We will miss out on software, and on the people who build it. If not missing out on software requires debunking dataspeak, please, let’s do just that. And if it entails investing lots of euros in the means to de-RUMP data, please, let’s do that too.

FAIR data is meaningless unless software de-RUMPs it. Software is the clever part of data. Software is knowledge, and we live in a knowledge society. Without software we’d be helpless. Without software we wouldn’t be able to pay taxes, order from web shops, plan houses or build highways. We wouldn’t be able to address climate crises or resolve pandemics.

The people who build that software are called research software engineers. They are the Willy Wonkas of knowledge, working at the juncture of computers and science; they are in-depth researchers and methodological experts rolled into one. And it is high time to recognise them as such.

This blog originated in a dialogue with our Programme Director Frank Seinstra. I would like to thank him and colleagues Lieke de Boer, Maaike de Jong, Rob van Nieuwpoort and Tom Bakker for their input, remarks and patience.

[1] If you don’t believe me, try something Very Official like the OECD: “A growing number of people in academia”, they claim, “combine expertise in programming with an intricate understanding of research.” The OECD refers to these people as research software engineers or RSEs, and the OECD should know.