Training similarity analysis

Language Embeddings and BERT

One of the core breakthroughs in natural language processing (NLP) is the invention of language embeddings. Unlike naturally numeric data, text cannot be directly used by machine learning models.

First it must be transformed into numerical representations with the meaning of the text embedded. Language models achieve this by representing text as vectors. These are series of numbers, which reflect different aspects of the original text’s semantic meaning. In practice these vectors are created by machine learning models which are trained on comprehension tasks.

The canonical example of language embeddings, word-2-vec, is a shallow neural network model that is trained to perform simple language comprehension tasks (Mikolov et al., 2013). An example is predicting the missing word in a sentence. Since this can be applied to any text, a large set of text (a corpus) is used to train the model, which helps the model understand a broad range of contexts. The vectors representing the text start randomly initialised. The model adjusts these during training to best capture the meaning of the text. The final output is a dictionary of word vectors that can be used for further NLP models and applications. As seen below, these vectors encode semantic meaning in vector-space so that, when plotted, neighbouring words have related meaning and intuitive distance properties.

Figure 1: Word-2-Vec encodes words in vector-space with intuitive distance properties
Source: Google, 2020

Word-2-Vec encodes words in vector-space with intuitive distance properties

While ground-breaking at the time of release, the original word-2-vec model had several limitations. This was due to a combination of its algorithmic simplicity, training data size, and its failing to consider the context of words in a sentence as well as words with multiple meanings.

The current state-of-the-art in language embeddings are attention-based transformer models. Compared to word-2-vec, these transformer models are much larger. For example, the recent Open AI’s GPT-3, released in June 2020, has 175 billion machine learning parameters, requires computing infrastructure worth around $50 million in order to train, and used a scrape of the entire internet as training data (i.e. 45,000GB of text) (Brown et al., 2020).

GPT-3 is a research model and is not practical for most small-scale NLP applications. However, the previous-generation transformer models such as Google’s Bidirectional Encoder Representations from Transformers (BERT) model, which broke several NLP benchmarks upon its release in 2018, is available for public use and is maintained by Google (Devlin et al., 2019). Transformer models provide high performance to ordinary users by taking advantage of the asymmetry in resources required for model training and model inference (use in applications). Model training requires dedicated large-scale infrastructure and enormous datasets. However, once trained, these models can be downloaded pre-trained and model inference performed on a typical high-end laptop to be used in applications.


The model uses VET administrative data to determine the degree of similarity between each qualification against all other qualifications. The analysis included all training package qualifications current as at December 2020 listed on the National Register of VET. Accredited courses and nationally accredited skill sets are not currently included due to the lack of detailed course information currently published on the national register. These could be included in a future update if the data is made available to the NSC. The current version of the model has 1312 qualifications in-scope.

The information available about VET training package qualifications can be broken down into several components that can be used in this analysis. As displayed on the National Register of VET, a qualification has a title, a description of a few paragraphs, and contains a list of core and elective units which each have their own titles and descriptions.