Innovation Lab > Whitepaper: From scikit-learn to SageMaker in multilabel text classification

CLASSIEfier: From scikit-learn to SageMaker in multilabel text classification

By Paola Oliva-Altamirano, Data Scientist, Our Community

Tracking the flow of funding to and within the Australian social sector has historically been difficult because of inconsistencies in categorisation (or the absence of categorisation entirely!).

Our Community developed CLASSIE to address this problem. Having developed CLASSIE as a universal classification system for Australian social sector initiatives, Our Community is now developing a machine learning algorithm called CLASSIEfier to reduce the need for manual classification. In the long-term, CLASSIEfier will help answer fundamental questions: Where is the money going? What impact is it having? And is the money going to those most in need?

Here we present the results of our experiments with different model training and deployment options using classical machine learning packages such as scikit-learn on a local computer and using pre-built models in AWS SageMaker.


Conclusion

SageMaker is a great tool for senior Data Scientists who know well how machine learning works and know what to expect from model training and deployment. If you are thinking of using SageMaker you need to be hyperaware of model and data biases and know the training data in detail to avoid bad AI design. For example, if CLASSIEfier provides marginally wrong classifications, such as nesting 'kids recreational activities' under 'sport professionals' this could have serious consequences. Grantmakers might think that they are supporting too many professional sport causes and cut down the funding to those areas, when in reality they were simply funding kids. Overall, though, SageMaker, when used carefully, can save you a lot of time and give you a pain-free machine learning journey.


For a detailed discussion of the results, see our white paper below.




Download From scikit-learn to SageMaker in multilabel text classification (1200kb)