Innovation Lab > Using auto-classification to paint a picture of social sector trends
CLASSIEfier: Using auto-classification to paint a picture of social sector trends
By Paola Oliva-Altamirano, Data Scientist, Our Community
SUMMARY: Tracking the flow of funding and other support to social sector organisations in Australia has historically been difficult because of inconsistencies in categorisation, or the absence of categorisation entirely.
Our Community developed CLASSIE to serve as a universal classification system for Australian social sector initiatives and entities.
We are now developing an algorithm to reduce or remove the need for manual (human) classification. Once released, CLASSIEfier will allow us to classify historical records on behalf of grantmakers and other social sector supporters, and reduce the need for human intervention in classification of current and future records.
In 2016, Our Community launched the Classification system for Social Sector Initiatives and Entities (CLASSIE). This taxonomy – based on the US-based Foundation Center’s Philanthropy Classification System – provides a tool for classifying information related to Australia’s social sector in a standardised way.
Roughly $80 billion is given away in grants every year in Australia (Grants in Australia 2018 research study), but we are yet to establish a clear overall picture of the flow of money by sector, location and beneficiary. Using CLASSIE in the Australian grantmaking environment is enabling us to start filling in the blanks.
Soon after CLASSIE was developed, its "subject" and "population" sections were incorporated into Our Community’s grants administration platform, SmartyGrants, enabling grantseekers to select the subjects and populations of their grant applications, or the grantmakers to do that on their behalf.
Currently, 15% of SmartyGrants grantmakers use CLASSIE, and they have generated around 6000 classified grant applications to date.
While we are confident grantmakers’ use of CLASSIE will continue to expand, our current system relies on users to classify their own grants, and we know from experience that manual classification – using humans to read and classify each application – is time consuming.
Furthermore, current and future grant applications represent only a fraction of the data we would like to classify, given that SmartyGrants holds almost 400,000 historical grant application records.
Against this background, CLASSIEfier was born. CLASSIEfier is an algorithm designed to automatically apply the CLASSIE taxonomy to grant applications, or indeed to any relevant social sector data.
Dr Paola Oliva-Altamirano
CLASSIEfier: How does it work?
CLASSIEfier is an algorithm that reads a grant application - or any text related to the social sector - and predicts the main subjects and populations involved.
Initially, we considered machine learning as a solution to our problem. However, machine learning algorithms must be trained with a large set of already labelled applications to learn the representative writing patterns and vocabulary of each CLASSIE category - something that we didn't have at the moment.
Our testing has shown that at least 2000 applications per CLASSIE category are needed to generate good results. CLASSIE subject has 900+ categories, resulting in in more than 180,000 labelled applications needed. It is impractical to reach those kinds of targets by manually classifying the data.
After extensive trials and research, we discovered that we could successfully extract keywords from the SmartyGrants database and create a controlled vocabulary to describe each CLASSIE category. We landed in creating an algorithm who follows a keyword-matching model to perform auto-classification.
A keyword-matching model
Keyword-matching is a common technique used to find keywords in text. We use keyword-matching relying on the hypothesis that certain combination of keywords can be used to describe a CLASSIE category.
The model uses three different groups of keywords and applies certain rules for each category:
- Unique keywords: They would be a clear and distinct representation of a CLASSIE category.
- Context keywords: They can be general keywords but will complement the unique keywords and give meaning to the text.
- Exclusion keyword: When these keywords are found in the text a category can be excluded even if there was a match of unique and context keywords.
See the example below. With the keyword-matching algorithm we can classify social sector text with 80% accuracy.
This is how the keyword-matching model works for the CLASSIE category "Cancers".
While developing CLASSIEfier, we concluded that it is not feasible to classify human natural languages with 100% accuracy. We found many cases where keyword matches led to a wrong classification. For example, an application containing the words "church", "religious" and "Christian" would be classified under "Religion" even if the application concerned a fete at a Catholic school.
We are exploring this issue by constantly searching for biases and involving third parties in the CLASSIEfier testing. We will discuss the results in a future article.
The hierarchy of classifications
CLASSIE comprises a hierarchical taxonomy, where many categories themselves have “child” categories.
This is a simplified view of how CLASSIE subjects are structured, with the actual taxonomy including many more categories.
Consider a grant application aimed at helping teenagers with autism. This application will have the following classifications:
- “Health” at level 1
- “Diseases and conditions” at level 2
- “Brain and nervous system disorders” at level 3
- “Autism” at level 4
In classifying this application, the grantmaker or grantseeker may select the level 4 category “Autism”; doing so will automatically nest the application in the corresponding classification at higher levels (“Brain and nervous system disorders”; “Diseases and conditions”; “Health”).
This application will have two beneficiaries:
- “Children and youth (age 0-17)” at level 1
- “Adolescents (people aged 13-17)” at level 2
And also, perhaps:
- “People with disabilities” at level 1
- “People with intellectual disabilities” at level 2
As this example shows, most grant applications can be categorised by more than one label, which of course increases the complexity of CLASSIEfier.
To overcome this challenge, the algorithm runs from the higher levels to the lowest levels. It first matches the keywords in the most detailed categories (level 4 in Subjects and level 3 in Populations) and rolls the classification back to less detail if needed.
One CLASSIEfier, multiple uses
Once CLASSIEfier is tested and implemented, it will be able to classify almost any text relating to the social sector. We will offer this tool for use by grantmakers and other social sector supporters who wish to understand more about their own funding and support patterns, and by those who wish to know about and participate in mapping of universal trends.
The tool can be used to classify data not only within the SmartyGrants system but also across other enterprises, including GiveNow (Our Community’s donations platform), Funding Centre (our grantseeking database), and Good Jobs (our jobs search platform). External uses may be found for the tool too.
Thus we can further standardise how information is managed, allowing illumination of trends and comparisons within a specific account or domain as well as within and across sectors.
CLASSIEfier is the first of many artificial intelligence initiatives that Our Community is pursuing.