BERT can encode multi-word text, unlike its predecessor word-2-vec. Therefore, each text components mentioned above can be encoded by BERT directly, resulting in 5 sources of embeddings:
- Qualification Title
- Qualification Description
- Qualification Keywords (extracted from the description)
- Core Unit Titles
- Elective Unit Titles
In cases where there are multiple vectors (such as several core units within a qualification), vector averaging was applied.
Each qualification’s 5 components are then used to calculate a weighted average, and these weights were fine-tuned to improve performance. This results in a single BERT vector for each qualification which represents a combination of the original text.
The final step is to compare qualifications using vector distance to quantify how similar the vectors are. While this could be achieved with ordinary vector distance, cosine similarity is usually the preferred method in NLP. This is because cosine similarity captures not just the difference between texts, it can also account for nuances within the strength and meaning of language. It achieves this by calculating the difference in angles between the vectors rather than their distance.
In the example below, the vector distance between ‘good’ and ‘bad’ is the same as ‘good’ and ‘great’. This would indicate a similar level of difference in meaning between these pairs of words despite ‘good’ and ‘bad’ being opposites while ‘good’ and ‘great’ both indicate varying levels of positivity. However, the angular differential between ‘good’ and ‘bad’ is much larger than that between ‘good’ and ‘great’.
Figure 2: An Illustration of vector distance and cosine distance
Hence the overall methodology is to encode the components of qualifications using BERT vectors, average them together using a weighted average, and then perform cosine similarity on the final vectors. This produces a final measure of similarity between qualifications.
Model output and validation
The final output consists of an overall similarity score for qualification compared to every other qualification. The output also includes the similarity scores for each text component so that users can identify how the overall similarity score was calculated. These can be ranked, searched through, and filtered by both input and output as demonstrated in the searchable dashboard. Further details of the searchable dashboard can be found in Appendix A.
There is currently no single model which compares the similarity of course design within the Australian tertiary sector. Therefore, the results were manually checked by NSC analysts to assess model performance. The top 20 most similar qualifications for the most popular training package courses were manually assessed. The weighting of each text component has been adjusted to place less emphasis on keywords and elective units which will be discussed in further detail in the next section.