The output of the model is the similarity between pairs of qualification. To visually compare these, it is useful to see how large groups of qualifications relate to each other. This can be achieved using the dimension reduction algorithm, TSNE, which is able to compress the similarity of all qualification pairs into a 2-dimensional projection (van der Maaten & Hinton, 2008). This is shown below, for the example training packages, in figure 3.

In this projection, some distinct clusters appeared. Not surprisingly, some of these clusters mirror the existing training packages. This is due to the way VET training packages are bundled. Each training package services an industry, or a number of related industries, which require related skills such as construction or community services. Training packages also contain qualifications across multiple qualification levels (Certificate I up to Graduate Diploma) that focus on the same area of study.

Figure 3: The t-SNE algorithm produces a scatterplot that shows the relationship between Community Services, Health, and Defence

The t-SNE algorithm produces a scatterplot that shows the relationship between Community Services, Health, and Defence

There were some interesting findings that emerged. Some training packages were much more closely linked than others. For example, courses from the Health training package and the Community Services training package share a significant intersection. Examining this intersection in more detail shows this intersection with links to both Health and Community Services includes qualifications in areas such as disability, population health, allied health assistance and mental health. On the other end of the spectrum, there are courses which focus strictly on Health or Community Services such as pathology collection and youth justice respectively.

There were also some training packages that split into distinct clusters which indicates a diverse range of skills taught within the same training package. For example, the Defence training package splits into two clusters – with split one containing qualifications related to health, legal services, and psychological support, while split two focused on defence technology such as explosive ordinance. Further details can be seen in Appendix B.

Like all NLP models, the output is not perfect. NLP models consistently find words with multiple meanings challenging to process. One example in this analysis were qualifications in strata community management within the Property Services training package. Due to the high frequency of the word ‘community’ used within the course description along with words like ‘facilitate’ and ‘support’, these were considered by the model to be highly similar to the Community Services training package. To address this issue, the weighting on keywords was lowered.

The preliminary output also produced some unusual similarity pairings due to the elective units of competency within the qualifications. Some qualifications have a very high number of elective units and relatively few core units. For example, the Certificate III in Agriculture has 133 current elective units and only 2 current core units. Therefore, it would be impossible for anyone completing this qualification to undertake all the elective units on offer. To overcome this, the weighting on individual elective units was reduced.

Another limitation identified was due to the information available about VET courses. Course descriptions often contain words such as assessment, training, certification and regulatory requirement. This means that qualifications in the Teaching and Assessment training package have a lot of false positive matches across many other VET qualifications. This issue is unique to the Training and Assessment training package qualifications.


The Australian VET system is complex. There are over 15,000 units of competency across the 56 training packages. Analysing this using traditional methods would be time consuming and difficult. Traditional methods of manually analysing and codifying text are time consuming and prone to sampling errors and biases based on the reliance of human judgement.

The machine learning techniques and NLP described in the paper can be used to better understand the qualifications and skills being taught through the Australian VET system. The NSC will continue to build on this exploratory training similarity analysis. It has the potential to help identify similar training products which can assist in the simplification of the VET system and qualification design. This work will also deepen the current understanding of the links between jobs, skills and training. The sharing of intelligence aims to enrich Australia’s capacity to better understand and adapt to the changing labour market.

The NSC welcomes all feedback regarding potential use cases and model improvements via: