User:Ian Helmke/Sandbox: Difference between revisions
imported>Ian Helmke mNo edit summary |
imported>Ian Helmke No edit summary |
||
Line 7: | Line 7: | ||
== Overfitting == | == Overfitting == | ||
Overfitting occurs when a classifier trains so closely to | Overfitting occurs when a classifier trains so closely to example data | ||
that it | that it becomes useless classifying things outside of the [[model_(theory)|model]]. | ||
This can sometimes occur because a classifier looks at irrelevant data | This can sometimes occur because a classifier looks at irrelevant data | ||
points in the training data. Classifiers often give more weight to | points in the training data. Classifiers often give more weight to | ||
features seen less commonly in the training data, and if data given to | [[feature_(vector)|features]] seen less commonly in the training data, and if data given to | ||
the classifier shares similar uncommon values, it | the classifier shares similar uncommon values, it groups those values | ||
accordingly. As a result, the classifier is extremely accurate when classifying the | accordingly. As a result, the classifier is extremely accurate when | ||
training data and appears to be useful, but when | classifying the training data and appears to be useful, but when used | ||
to organize data outside of the model, it becomes inaccurate. | |||
Machine learning techniques can prevent overfitting | Machine learning techniques can prevent overfitting by | ||
imposing a penalty upon itself for complicated models. A | imposing a [[penalty_(machine learning)|penalty]] upon itself for complicated models. A | ||
simpler model is oftentimes more consistent with the trends of the | simpler model is oftentimes more consistent with the trends of the | ||
data in question. | data in question. | ||
Line 25: | Line 25: | ||
== Imbalance of Data == | == Imbalance of Data == | ||
Data within training sets can be imbalanced, where the training set | |||
has more examples of one category of data than another. Many | |||
classifying algorithms assume that the ratio of the different | |||
categories of training data that they receive are roughly equivalent | |||
to the ratio of real data that will belong in each category. | |||
that the ratio of the different categories of training data that they | |||
receive are roughly equivalent to the ratio of real data that will | |||
belong in each category. | |||
There may also be characteristics that differentiate members of the | |||
minority class that are missed when data is imbalanced. For example, | |||
a classifier training on examples of cancer patients may fail to | |||
differentiate between different types of cancer in the minority | |||
class<ref name="ensemble"> | |||
{{cite journal | |||
|author=Sangyoon, Oh et al. | |author=Sangyoon, Oh et al. | ||
|title=Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification | |title=Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification | ||
Line 42: | Line 43: | ||
|year=2011 | |year=2011 | ||
|id=http://dx.doi.org/10.1109/TCBB.2010.96}}</ref>. | |id=http://dx.doi.org/10.1109/TCBB.2010.96}}</ref>. | ||
Active example selection builds the | |||
classifier model slowly, adding a few pieces of sample data of each | One way to prevent this imbalance of data is to use | ||
class at a time. The model is tested at every stage, and documents | [[active example selection]]<ref name="ensemble" />. Active example | ||
which improve the accuracy of the model are kept in the classifier, | selection builds the classifier model slowly, adding a few pieces of | ||
while ones that do not change the model or make its performance worse | sample data of each class at a time. The model is tested at every | ||
are removed. This ensures that only meaningful pieces of data are used | stage, and documents which improve the accuracy of the model are kept | ||
in the classifier, while ones that do not change the model or make its | |||
performance worse are removed. This ensures that only meaningful | |||
pieces of data are used to train the classifier, and since the ratio | |||
of documents is closer to 1:1, an imbalance of data is less of an | |||
issue. | |||
== Evaluating Results == | == Evaluating Results == | ||
There are a number of techniques for evaluating the results of a | There are a number of techniques for evaluating the results of a | ||
machine learning algorithm. Some of these techniques are also used in | machine learning [[algorithm]]. Some of these techniques are also used in | ||
natural | [[computational linguistics|natural language processing]]. Machine learning techniques are generally | ||
evaluated by their results, and some methods (such as neural networks) | evaluated by their results, and some methods (such as [[neural networks]]) | ||
are considered "black box" forms of classification, since it is not | are considered "black box" forms of classification, since it is not | ||
easy to understand how or why the underlying implementation is sorting | easy to understand how or why the underlying implementation is sorting | ||
a particular way. | |||
In some cases, particularly with classifiers, the outcome of the | In some cases, particularly with classifiers, the outcome of the | ||
machine learning algorithm is compared to a set of data classified by | machine learning algorithm is compared to a set of data classified by | ||
experts. | experts. The machine learning algorithm creates a model based on a set | ||
of training data, and | of training data, and classifies a second sample set of data. A | ||
group of experts also annotate | group of experts also annotate the second set of data, marking it | ||
according to how they believe it should be classified (not according | according to how they believe it should be classified (not according | ||
to how they believe a machine would classify it). | to how they believe a machine would classify it). The training and | ||
evaluation data is generally a tiny subset of the data available | evaluation data is generally a tiny subset of the data available. | ||
The results of the algorithm are compared to the results of the | The results of the algorithm are compared to the results of the | ||
experts and arranged into two scores: precision and recall. Precision | experts and arranged into two scores: [[precision]] and [[recall]]. Precision | ||
accounts for situations where the classifier put something in a | accounts for situations where the classifier put something in a | ||
category where it did not belong. Recall accounts for situations where | category where it did not belong. Recall accounts for situations where | ||
the classifier did not put something in a category it should have. | the classifier did not put something in a category it should have. | ||
If a classifier is able to produce good results for a subset of data, it | |||
should also be successful at classifying a larger set of similar data. | |||
Classification and clustering algorithms can also be measured against | Classification and clustering algorithms can also be measured against | ||
Line 86: | Line 91: | ||
== Scalability == | == Scalability == | ||
Modern computers have the ability to [[multitask]] exceptionally | |||
computers | well through the use of multiple cores or [[CPU]]s. Machine learning | ||
well through the use of multiple cores or | algorithms are often used to process large quantities | ||
of data, and when they are optimizing them for multiple CPUs greatly | |||
improves their performance. | |||
= Biclustering = | = Biclustering = | ||
Line 102: | Line 104: | ||
'''Biclustering''' is an unsupervised [[machine learning]] method | '''Biclustering''' is an unsupervised [[machine learning]] method | ||
which searches for similarities in specific subsections of the input | which searches for similarities in specific subsections of the input | ||
data. | data. Biclustering is unique among machine learning methods because it searches | ||
for similarities in small parts of the data, instead of putting a | for similarities in small parts of the data, instead of putting a | ||
piece of data into a single group. | piece of data into a single group. | ||
Biclustering was first discovered in 1970. Today, it is a commonly | Biclustering was first discovered in 1970. Today, it is a commonly | ||
used technique in [[bioinformatics]], particularly in the area of gene | used technique in [[bioinformatics]], particularly in the area of [[gene | ||
expression, or identifying groups of genes that are similar between | expression]], or identifying groups of genes that are similar between | ||
different people. | different people. | ||
Line 121: | Line 123: | ||
Biclustering is a useful technique for finding trends in data when | Biclustering is a useful technique for finding trends in data when | ||
each vector of data is | each vector of data is large because it can spot trends in | ||
specific parts of data that clustering cannot. A normal clustering | specific parts of data that clustering cannot. A normal clustering | ||
algorithm sorts pieces of data according to features that the majority | algorithm sorts pieces of data according to features that the majority | ||
of them share. Biclustering | of them share. Biclustering organizes data according to | ||
parts of them that seem similar. | parts of them that seem similar. | ||
Line 131: | Line 133: | ||
Biclustering is particularly useful in the medical field, where it can | Biclustering is particularly useful in the medical field, where it can | ||
be used, for example, to find genes related to a specific disease in a | be used, for example, to find genes related to a specific disease in a | ||
group of patients<ref>{{cite journal | group of patients<ref> | ||
{{cite journal | |||
|author=Still, Martin et al. | |author=Still, Martin et al. | ||
|title=Robust biclustering by sparse singular value decomposition incorporating stability selection | |title=Robust biclustering by sparse singular value decomposition incorporating stability selection |
Revision as of 22:04, 1 August 2011
Clustering Additions
This is text I'd like to add to the machine learning page (pending heavy editing). This text would go underneath the "Issues in Training and Evaluation" section.
Overfitting
Overfitting occurs when a classifier trains so closely to example data that it becomes useless classifying things outside of the model.
This can sometimes occur because a classifier looks at irrelevant data points in the training data. Classifiers often give more weight to features seen less commonly in the training data, and if data given to the classifier shares similar uncommon values, it groups those values accordingly. As a result, the classifier is extremely accurate when classifying the training data and appears to be useful, but when used to organize data outside of the model, it becomes inaccurate.
Machine learning techniques can prevent overfitting by imposing a penalty upon itself for complicated models. A simpler model is oftentimes more consistent with the trends of the data in question.
Imbalance of Data
Data within training sets can be imbalanced, where the training set has more examples of one category of data than another. Many classifying algorithms assume that the ratio of the different categories of training data that they receive are roughly equivalent to the ratio of real data that will belong in each category.
There may also be characteristics that differentiate members of the minority class that are missed when data is imbalanced. For example, a classifier training on examples of cancer patients may fail to differentiate between different types of cancer in the minority class[1].
One way to prevent this imbalance of data is to use active example selection[1]. Active example selection builds the classifier model slowly, adding a few pieces of sample data of each class at a time. The model is tested at every stage, and documents which improve the accuracy of the model are kept in the classifier, while ones that do not change the model or make its performance worse are removed. This ensures that only meaningful pieces of data are used to train the classifier, and since the ratio of documents is closer to 1:1, an imbalance of data is less of an issue.
Evaluating Results
There are a number of techniques for evaluating the results of a machine learning algorithm. Some of these techniques are also used in natural language processing. Machine learning techniques are generally evaluated by their results, and some methods (such as neural networks) are considered "black box" forms of classification, since it is not easy to understand how or why the underlying implementation is sorting a particular way.
In some cases, particularly with classifiers, the outcome of the machine learning algorithm is compared to a set of data classified by experts. The machine learning algorithm creates a model based on a set of training data, and classifies a second sample set of data. A group of experts also annotate the second set of data, marking it according to how they believe it should be classified (not according to how they believe a machine would classify it). The training and evaluation data is generally a tiny subset of the data available. The results of the algorithm are compared to the results of the experts and arranged into two scores: precision and recall. Precision accounts for situations where the classifier put something in a category where it did not belong. Recall accounts for situations where the classifier did not put something in a category it should have. If a classifier is able to produce good results for a subset of data, it should also be successful at classifying a larger set of similar data.
Classification and clustering algorithms can also be measured against other algorithms. This is useful when an algorithm is attempting to improve performance (speedwise, for example) while providing similar levels of precision and recall relative to another algorithm. It can also be used to show that an algorithm is an improvement over a previous generation, or to show which algorithm is most useful for organizing data for a particular problem.
Scalability
Modern computers have the ability to multitask exceptionally well through the use of multiple cores or CPUs. Machine learning algorithms are often used to process large quantities of data, and when they are optimizing them for multiple CPUs greatly improves their performance.
Biclustering
The following material would go on a new page of the above title (probably linked to the ML page)
Biclustering is an unsupervised machine learning method which searches for similarities in specific subsections of the input data. Biclustering is unique among machine learning methods because it searches for similarities in small parts of the data, instead of putting a piece of data into a single group.
Biclustering was first discovered in 1970. Today, it is a commonly used technique in bioinformatics, particularly in the area of [[gene expression]], or identifying groups of genes that are similar between different people.
Processing
Clustering algorithms take vectors as input. They sort the vectors according to how similar they are by comparing all of the features (data) in the vector, and the result of the clustering is several groups that each contain a bunch of (hopefully) similar vectors. Biclustering looks at all of the vectors as a single input matrix, and attempts to find regions of the input which look similar.
Biclustering is a useful technique for finding trends in data when each vector of data is large because it can spot trends in specific parts of data that clustering cannot. A normal clustering algorithm sorts pieces of data according to features that the majority of them share. Biclustering organizes data according to parts of them that seem similar.
Applications
Biclustering is particularly useful in the medical field, where it can be used, for example, to find genes related to a specific disease in a group of patients[2]. If each vector represents how a person expresses traits, biclustering can be used to determine a set of genes which is associated with cancer. It can even be used to find similarities and differences between different varieties of cancers.
References
- ↑ 1.0 1.1 Sangyoon, Oh et al. (2011). "Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification". IEEE/ACM Trans. Comput. Biol. Bioinformatics. http://dx.doi.org/10.1109/TCBB.2010.96.
- ↑ Still, Martin et al. (2011). "Robust biclustering by sparse singular value decomposition incorporating stability selection". Bioinformatics. http://dx.doi.org/10.1093/bioinformatics/btr322.