Machine Learning is Fun! (part 2)
The world’s easiest introduction to Machine Learning
1.2 Terminology
To conduct machine learning, we must have first. Suppose we have collected a set of watermelon records, for example, (color = dark; root = curly; sound = muffled), (color = green; root = curly: sound = dull), (color = light: root = straight; sound = crisp), …. where each pair of parentheses encloses one record and “ =” means “takes value”.
Collectively, the records form a data set, where each record contains the description of an event or object. e.g.. a watermelon. A record, also called an instance or a sample, describes some attributes of the event or object, e.g., the color, root. and sound of a watermelon. These descriptions are often called attributes or features, and their values, such as green and dark. are called attribute values. The space spanned by attributes is called an attribute space, sample space, or input space. For example. if we consider color. root, and sound as three axes, then they span a three-dimensional space describing watermelons. and we can position every watermelon in this space. Since every point in the space corresponds to a position vector, an instance is also called a feature vector.
More generally, let D = {X1, X2….. Xm} be a data set containing m instances, where each instance is described by d attributes. For example, we use three attributes to describe watermelons. Each instance Xi = (Xi1 ; Xi2:… , Xid) ∈ X is a vector in the d-dimensional sample space X, where d is called the dimensionality of the instance Xi, and Xij) is the value of the jth attribute of the instance Xi. For example, at the beginning of this section, the second attribute of the third watermelon takes the value straight.
The process of using machine learning algorithms to build models from data is called learning or training. The data used in the training phase is called training data, in which each sample is a training example, and the set of all training examples is called a training set. Since a learned model corresponds to the under-lying rules about the data, it is also called a hypothesis, and the actual underlying rules are called the facts or ground-truth. Then, the objective of machine learning is to find or approximate ground-truth. In this book, models are sometimes called learners, which are machine learning algorithms instantiated with data and parameters.
Nevertheless, the samples in our watermelon example are not sufficient for learning a model that can determine the ripeness of uncut watermelons. In order to train an effective prediction model, the outcome information must also be available, e.g., ripe in ((color = green; root = curly; sound = muffled), ripe). The outcome of a sample, such as ripe or unripe, is often called a label, and a sample with a label is called an example. More generally, we can write the ith sample as (Xi, Yi), where Yi ∈ y is the label of the sample xi, and Y is the set of all labels, also called the label space or output space.
When the prediction output is discrete, such as ripe and unripe, it is called a classification problem; when the prediction output continuous, such as the degree of ripeness, it is called, a regression problem. If the prediction output has only two possible classes, then it is called a binary classification problem, where one class is marked as positive and the other is marked as negative. When more than two classes are present, it becomes a multiclass classification problem. More generally, a prediction y from the input problem is to establish a mapping .f : X → Y space x to the output space y by learning from a training set {(xi , yi), (x2, Y2), . . . , (xm, Ym)}. Conventionally, we let y = {-1, +1 or {0, l} for binary classification problems, |y| > 2 for multiclass classification problems, and y = R for regression problems, where R is the set of real numbers.
The process of making predictions with a learned model is called testing, and the samples to be predicted are called testing samples. For example, the label y of a testing sample x can be obtained via the learned model y =f(x).
Other than predictions, another type of learning is clustering. For example, we can group watermelons into several clusters, where each cluster contains the watermelons that share some underlying concepts, such as light color versus dark color, or even locally grown versus imported. Clustering often provides data insights that form the basis of further analysis. However, we should note that the concepts, such as light color or locally grown, are unknown before clustering, and the samples are usually unlabeled.
Depending on whether the training data is labeled or not, we can roughly divide learning problems into two classes: supervised learning (e.g., classification and regression) and unsupervised learning (e.g., clustering).
It is worth mentioning that the objective of Machine Learning is to learn models that can work well on the new samples, rather than the training examples. The same objective also applies to unsupervised learning (e.g., clustering) since we wish the learned clusters work well on the samples outside of the training set. The ability to work on the new samples is called generalization ability, and a well-generalized model should work well on the whole sample space. Although the training set is usually a tiny proportion of the sample space, we still hope the training set can, to some extent, reflect the characteristics of the whole sample; otherwise, it would be hard for the learned model to work well on the new samples. We generally assume that all samples in a sample space follow a distribution tribution, that is, independent and identically distributed (i.i.d.). Generally speaking, the more samples we have, the more information we know about the distribution D, and consequently, the better-generalized model we can learn.
Don’t Forget to Follow me:
Twitter: https://twitter.com/Eviloon1
Instagram: https://www.instagram.com/abubakkar_sattar/
LinkedIn: https://www.linkedin.com/in/abubakar-sattar-807a27178/