CLASSIEfier: Using machine learning to paint a picture of social sector trends
By Paola Oliva-Altamirano, Data Scientist, Our Community
SUMMARY: Tracking the flow of funding and other support to social sector organisations in Australia has historically been difficult because of inconsistencies in categorisation, or the absence of categorisation entirely. Our Community developed CLASSIE to serve as a universal classification system for Australian social sector initiatives and entities. We are now developing an algorithm to reduce or remove the need for manual (human) classification. Once released, CLASSIEfier will allow us to classify historical records on behalf of grantmakers and other social sector supporters, and reduce the need for human intervention in classification of current and future records.
In 2016, Our Community launched the Classification system for Social Sector Initiatives and Entities (CLASSIE). This taxonomy – based on the US-based Foundation Center’s Philanthropy Classification System – provides a tool for classifying information related to Australia’s social sector in a standardised way.
Roughly $80 billion is given away in grants every year in Australia (Grants in Australia 2018 research study), but we are yet to establish a clear overall picture of the flow of money by sector, location and beneficiary. Using CLASSIE in the Australian grantmaking environment is enabling us to start filling in the blanks.
Soon after CLASSIE was developed, its "subject" and "population" sections were incorporated into Our Community’s grants administration platform, SmartyGrants, enabling grantseekers to select the subjects and populations of their grant applications, or the grantmakers to do that on their behalf.
Dr Paola Oliva-Altamirano
Currently, 15% of SmartyGrants grantmakers use CLASSIE, and they have generated around 6000 classified grant applications to date. While we are confident grantmakers’ use of CLASSIE will continue to expand, our current system relies on users to classify their own grants, and we know from experience that manual classification – using humans to read and classify each application – is time consuming.
Furthermore, current and future grant applications represent only a fraction of the data we would like to classify, given that SmartyGrants holds almost 400,000 historical grant application records.
Against this background, CLASSIEfier was born. CLASSIEfier is an algorithm designed to automatically apply the CLASSIE taxonomy to grant applications, or indeed to any relevant social sector data.
CLASSIEfier: How does it work?
We are building a supervised machine learning algorithm that will read a grant application – or any text related to the social sector – and predict the main subjects and populations involved.
First, the machine learning algorithm must be trained with a large set of already labelled applications to enable it to learn the representative writing patterns and vocabulary of each CLASSIE label.
The figure below shows an example using grant applications in the health sector.
CLASSIEfier stages 1 and 2: Keyword matching
The initial challenge we encountered was finding enough “labelled” applications to train the CLASSIEfier algorithm. The existing number of classified records in SmartyGrants (6000 in total, featuring 900 CLASSIE labels) was not large enough to train the algorithm effectively.
Our testing has shown that at least 2000 applications per CLASSIE label are needed to generate good results. It is impractical to reach those kinds of targets by manually classifying the data. Thus we developed a keyword-matching algorithm to automatically extract applications containing keywords related to particular CLASSIE labels.
For example, if the subject of an application is “health”, we also expect to find keywords such as “health”, “hospital”, “disease”, “nurse”, “doctor” and “clinic”. Thus, the first stage of the CLASSIEfier project was to select keywords, and the second stage involved extracting applications that exhibited a strong keyword match.
With the keyword-matching algorithm we classified 128,000 applications with 80% accuracy. We then used these classified applications as a training dataset to feed the machine learning algorithm.
While developing CLASSIEfier, we concluded that it is not feasible to classify human natural languages with 100% accuracy. We found many cases where keyword matches led to a wrong classification. For example, an application containing the words “church”, “religious” and “Christian” would be classified under “Religion” even if the application concerned a fete at a Catholic school.
We will explore this issue further in a future article.
Stage 3: Testing machine learning
CLASSIE comprises a hierarchical taxonomy, where many categories themselves have “child” categories.
[CAPTION: This is a simplified view of how CLASSIE subjects are structured, with the actual taxonomy including many more categories.]
Consider a grant application aimed at helping teenagers with autism. This application will have the following classifications:
- “Health” at level 1
- “Diseases and conditions” at level 2
- “Brain and nervous system disorders” at level 3
- “Autism” at level 4
In classifying this application, the grantmaker or grantseeker may select the level 4 category “Autism”; doing so will automatically nest the application in the corresponding classification at higher levels (“Brain and nervous system disorders”; “Diseases and conditions”; “Health”).
This application will have two beneficiaries:
- “Children and youth (age 0-17)” at level 1
- “Adolescents (people aged 13-17)” at level 2
And also, perhaps:
- “People with disabilities” at level 1
- “People with intellectual disabilities” at level 2
As this example shows, most grant applications can be categorised by more than one label, which of course increases the complexity of training our machine learning algorithm.
To overcome this challenge, we are training CLASSIEfier at separate levels. At level 1, CLASSIE contains 18 subjects (e.g. health; art and culture; sport and recreation) and 56 beneficiaries (e.g. children; students; indigenous people; people with disabilities).
At level 1, applications are quite distinctive, making it easier for the algorithm to identify writing patterns and vocabularies. In addition, level 1 is the most popular (i.e. most populated) level from our already classified applications, so we have more training data available. For most level 1 categories we already have the required minimum of 2000 grant applications.
Dealing with 'black holes' and niche classifications
As we progress to other levels, the “granularity” of the classification increases. This makes it harder to achieve the minimum 2000 records needed to train the machine learning algorithm. For example, our keyword matching algorithm found just 200 applications with the label “People of North American descent”.
There are a number of niche subjects and populations that are not supported by the model (e.g. widows and widowers, Confucist groups, interfaith groups, domestic workers). We call these labels – those with few or no applications, meaning the algorithm cannot be trained to recognise them – “black holes”.
We are aware that the existence of black holes in our data model could result in further marginalisation of non-mainstream subjects and populations. In an effort to address this, we are using the previously described keyword matching algorithm to find grant applications that would fall under these categories.
Still, we expect not to be able to eliminate black holes entirely, at least at first. Therefore, each release of CLASSIEfier will include notes that identify the black holes in the data, in the interests of transparency about where the model falls short.
At this stage, we are focused on classifying applications up to level 2. Our ability to classify to levels 3 and 4 will depend on how much data we can collect from CLASSIE users in future years, and how well keyword matching performs in practice.
One CLASSIEfier, multiple uses
Once CLASSIEfier is tested and implemented, it will be able to classify almost any text relating to the social sector. We will offer this tool for use by grantmakers and other social sector supporters who wish to understand more about their own funding and support patterns, and by those who wish to know about and participate in mapping of universal trends.
The tool can be used to classify data not only within the SmartyGrants system but also across other enterprises, including GiveNow (Our Community’s donations platform), Funding Centre (our grantseeking database), and Good Jobs (our jobs search platform). External uses may be found for the tool too.
Thus we can further standardise how information is managed, allowing illumination of trends and comparisons within a specific account or domain as well as within and across sectors.
CLASSIEfier is the first of many artificial intelligence initiatives that Our Community is pursuing. We hope to further develop our capacity in this area with the February 2019 opening of OC House, where we will be working with not-for-profits to build their own capacity to work with data.