The Orange3 data mining platform for the Social Sciences and Humanities
Orange is an open-source Python-based data mining application with a graphical user interface (GUI). It allows people with no knowledge of programming to apply Machine Learning as well as advanced data processing, analysis, and visualization using a point-and-click, drag-and-drop interface. Those who have seen Scratch will be familiar with this mode of interaction, although Orange is more advanced and is used for data mining, rather than educational programming. The following screenshot shows what a typical Orange workflow looks like.
The three small circles labeled ‘File’, ‘Data Table’, and ‘Scatter Plot’ in the image above, are called Orange widgets. They are the basic building blocks of a data analysis workflow (or pipeline) in Orange. Each widget is a software unit that performs some sort of data processing, analysis, or visualization and potentially has a set of inputs and outputs. You can connect the outputs of some widgets to the inputs of others to create a chain of processing and analysis operations. A great feature of Orange is its extensibility. There are loads of custom plugins for Orange (called add-ons) developed by researchers and the Orange user community. Add-ons usually comprise multiple widgets. Two mature and extensively used Orange add-ons are Orange-Text and Network which are used for text mining and network analysis respectively.
Story Navigator: a custom Orange add-on I am developing
I am currently in the process of developing an add-on as part of a project with researchers at the University of Twente in the Netherlands. The add-on is called the Story Navigator. The aim of the Story Navigator is to enable computer-aided analysis of written stories for students and researchers who study narrative psychology. The add-on has several widgets to analyse different aspects of a story. It uses a combination of natural language processing, text mining, and general data analysis techniques to implement theories in narrative psychology, such as Burke’s pentad. Here’s a screen:
The add-on works by first importing textual stories into the workflow using the existing “Import Documents” widget from the Orange-Text add-on. Thereafter the researcher can choose a widget from the add-on for analysing some aspects of the imported stories. The actor analysis widget, for example, highlights potential characters in the stories; shows which kinds of actions (verbs) the characters are associated with; and calculates different measures for how central specific characters are to the stories.
While it did take some time to develop the widgets for this add-on, what surprised me is that it took me only about two hours, from scratch, with no prior knowledge of Orange, to develop my first widget that showed up in the Orange3 interface when I started it up. Granted, the widget did not do anything except display. But the return on my two-hour investment was that I could now free up my creativity to develop whatever functionality I wanted in the widget, without having to create my own GUI. In doing so, I got to take advantage of a myriad of benefits including being able to:
- Make my add-on available for anyone to use regardless of their programming knowledge
- Receive development and usage feedback from an existing Orange community about my add-on (without having to further promote the add-on or build another community myself from scratch)
- Demo my plugin to anyone, thanks to the GUI
- Share data analyses made with my add-on with other Orange users — enabling reproducibility (currently undergoing a crisis in science)
- Forget about writing installation instructions — they’re already here!
Despite what some readers might think at this point, I am not an Orange evangelist. In fact, I am not swooning over the capabilities and potential of Orange specifically. Orange does have competing platforms that try to do similar things such as KNIME and WEKA which are both Java-based and therefore inherit some concerns due to Java’s declining popularity. Orange itself also has some clear disadvantages, not least of which are the scalability issues and the add-on development documentation (the latter is not very accessible for less experienced developers).
But the point I am trying to make is that I am mostly just inspired by the philosophy behind platforms like these: people coming together to develop and share an extensible “workbench-style” software platform with a user-friendly interface. When I refer to software I don’t mean software that has a highly niche purpose, but software that aims to be an ever-evolving toolbox with the potential to address a multitude of research problems. Where your own creativity is the only limit in terms of contributing to it. Where the knowledge barriers to be able to contribute to it are super low, and where there are no qualifying criteria for those who can experience and benefit from the contributions.
Orange was developed at the Laboratory for Bioinformatics at the University of Ljubljana. It makes sense then that most of its user base uses it for data mining in bioinformatics (see for instance this study using Orange for early diagnosis of diabetes). But Orange itself is not at all designed to be domain-specific. Its add-ons and widgets are designed for domain-agnostic data mining (here’s a study using Orange to predict student performance in higher education). This begs the question, why aren’t more researchers in the Social Sciences and Humanities using it, extending it, or building similar platforms?
What the popular video game Elder Scrolls has to do with Orange
Bethesda Softworks is a US-based video game development company that is famous for its many franchises, one of which is “Elder Scrolls”. The games in this franchise are “RPGs” that put the player in control of a character from a fictional fantasy world with the goal to… wait for it… save the world from destruction (if you’re new to the term RPGs imagine a video game based on Game of Thrones and you’ll get the gist). In 2002, Bethesda released its third installment of Elder Scrolls, called “Morrowind”. The super interesting part is that they also released software called the Elder Scrolls Construction Set alongside the game. This software allowed players to make customizations to their own copy of the game. They could do things such as create new clothing, buildings, and other items; create their own character voices; make new quests, and customize the color of the sky. These modifications (mods as they are now commonly referred to) vary in terms of how much software literacy is required in order to build them, but many of them require no more knowledge than being able to install a program and click a few buttons.
Twenty years later, and Bethesda has become known and celebrated for giving its customers the freedom to customize their gaming experience. Nexusmods is one of the prominent hubs for hosting large collections of player-created mods for games by Bethesda and other major video game companies. I quote some jaw-dropping statistics from this site here:
“We host 477,145 mods for 2,342 games from 118,005 authors serving 41,902,274 members with 8,184,840,124 downloads to date.”
In academia, terms like “outputs”, “impact”, “community building” and “community engagement” are often bandied about as performance goals and metrics for work. However, despite their purported importance, these terms are hard to define and measure. But when I look at those numbers, I feel like we don’t need unambiguous definitions for those terms in order to verify that the mod concept ticks all those boxes and then some.
“Okay Kody, what does this all have to do with Orange?”, you may ask. Well, in some ways there are parallels in this story to the concept of Orange as a software platform.
First, there is a product that many people enjoy (or find useful). In Bethesda’s case, it is the “vanilla” Elder Scrolls games. In Orange’s case, it is a data mining platform which is a generic tool that is useful regardless of the research domain and data. The GUI aspect of Orange also opens it up to a much larger user base (not just a small group of researchers with development experience working in a niche field).
Second, there is the enabling of customization. In Bethesda’s case, it is the provision of customization software such as the Skyrim Creation Kit; in the case of Orange, it is the release of the code under an open-source license with reusable example code and plugin-writing tutorials.
Thirdly, and what I find to be the most critical, is making it easy to customize. In Bethesda’s case, they have made user-friendly software that does not require extremely specialized skills to use. Almost anyone can create a mod. In Orange’s case, you still need to know how to program to develop add-ons. But you don’t need advanced knowledge of the GUI frameworks that Orange uses — Qt and PyQt. In other words, if you know how to create any kind of computational script or software in Python, you can easily create an Orange widget to “house” or “wrap” its functionality. Furthermore, others, regardless of their technical nous, will be able to use it and “chain” your widget with other software in an Orange workflow.
What I have experienced and want to emphasize, is that plugging your software into existing platforms like Orange (especially ones that have a nice GUI) can increase its visibility and usage, over publishing it solely in a code repository. Decently designed GUIs help users to more quickly understand what your software does which goes a long way to maximizing its value. Platforms like Orange provide a mechanism by which developers can take advantage of these GUI benefits “for free” (well, not really for free, but with much less time investment than building and designing GUIs from scratch). It frees them up to use their creativity to solve research problems through widget development; it exposes their software to a much wider audience; and it allows users to interact with their software in a more intuitive way.
In short, platforms like Orange can be extremely useful both for researchers who want to quickly apply computational analyses in the SSH fields (without writing code) and for developers who want to increase the usability, visibility, and sustainability of their software. While it does, to some extent, tie the success of your software to the success of Orange, you are not “putting all your eggs in one basket”. Due to the modularity and already-in-place Python package structure of Orange add-ons and widgets, you can very easily strip the GUI out of your code and release your work as a stand-alone package, or migrate it to another platform.
Take home messages
- Orange is an open-source extensible data mining platform that is usable by people with little to no technical experience. If you are a researcher in SSH consider using it for your research. If it doesn’t have a feature you need, ask the developers if they can add it as a default feature to Orange. If they can’t add it, and if you or a colleague has Python experience, consider creating a custom add-on. It is surprisingly quick and easy to create one if you know a bit of Python.
- Research software sustainability is a challenge. Building and maintaining communities around the development of research software is hard. Don’t let your software become JAG (Just-Another-Github repo). Take inspiration for how to avoid this from platforms like Orange. This does not mean that you need to create a graphical user interface for your software. It means: find a home for your software or find an existing toolbox in which to integrate your software, before creating your own toolbox. Integrating into existing platforms or packages will probably increase its usage and visibility over publishing as a stand-alone item.
- Inspiration point #1: From the start, consider building software that solves generic problems in your field. i.e., problems that are shared by others in the field as well, rather than for highly specialized use cases.
- Inspiration point #2: Make your software easy to use by as many types of people as possible. Not just for experienced developers. Also, ensure that someone can rapidly understand what your software does.
- Inspiration point #3: Make it easy for as many people as possible to customize and extend your software.
Dr. Kody Moodley
Senior Research Software Engineer
Why aren’t more SSH researchers using and extending the platform?