How it works... in Software Render Data Matrix barcode in Software How it works...

7. using none torender none in web,windows application Windows Forms Let"s test the none none classifier on a couple of made up reviews. The classify() method takes a single argument, which should be a feature set. We can use the same bag_of_words() feature detector on a made up list of words to get our feature set.

. >>> fr om featx import bag_of_words >>> negfeat = bag_of_words(["the", "plot", "was", "ludicrous"]) >>> nb_classifier.classify(negfeat) "neg" >>> posfeat = bag_of_words(["kate", "winslet", "is", "accessible"]) >>> nb_classifier.classify(posfeat) "pos".

How it works... The label_feats _from_corpus() assumes that the corpus is categorized, and that a single file represents a single instance for feature extraction. It iterates over each category label, and extracts features from each file in that category using the feature_detector() function, which defaults to bag_of_words(). It returns a dict whose keys are the category labels, and the values are lists of instances for that category.

. If we had the l none none abel_feats_from_corpus() function, return a list of labeled feature sets, instead of a dict, it would be much harder to get the balanced training data. The list would be ordered by label, and if you took a slice of it, you would almost certainly be getting far more of one label than another. By returning a dict, you can take slices from the feature sets of each label.

. Now we need to split the labeled feature sets into training and testing instances using split_label_feats(). This function allows us to take a fair sample of labeled feature sets from each label, using the split keyword argument to determine the size of the sample. split defaults to 0.

75, which means the first three-fourths of the labeled feature sets for each label will be used for training, and the remaining one-fourth will be used for testing. Once we have split up our training and testing feats, we train a classifier using the NaiveBayesClassifier.train() method.

This class method builds two probability distributions for calculating prior probabilities. These are passed in to the NaiveBayesClassifier constructor. The label_probdist contains P(label), the prior probability for each label.

The feature_probdist contains P(feature name = feature value . label). In our case, it will store P(word=True label). Both a re calculated based on the frequency of occurrence of each label, and each feature name and value in the training data..

Text Classifica none none tion The NaiveBayesClassifier inherits from ClassifierI, which requires subclasses to provide a labels() method, and at least one of the classify() and prob_classify() methods. The following diagram shows these and other methods, which will be covered shortly:. There"s more... We can test the none for none accuracy of the classifier using nltk.classify.util.

accuracy() and the test_feats created previously.. >>> fr om nltk.classify.util import accuracy >>> accuracy(nb_classifier, test_feats) 0.

72799999999999998. This tells us t none for none hat the classifier correctly guessed the label of nearly 73 percent of the testing feature sets.. Classification probability While the class none none ify() method returns only a single label, you can use the prob_classify() method to get the classification probability of each label. This can be useful if you want to use probability thresholds greater than 50 percent for classification..

>>> pr obs = nb_classifier.prob_classify(test_feats[0][0]) >>> probs.samples() ["neg", "pos"] >>> probs.

max() "pos" >>> probs.prob("pos") 0.99999996464309127 >>> probs.

prob("neg") 3.5356889692409258e-08. 7 . In this case, t he classifier says that the first testing instance is nearly 100 percent likely to be pos.. Most informative features The NaiveBayesC lassifier has two methods that are quite useful for learning about your data. Both methods take a keyword argument n to control how many results to show. The most_informative_features() method returns a list of the form [(feature name, feature value)] ordered by most informative to least informative.

In our case, the feature value will always be True.. >>> nb none for none _classifier.most_informative_features(n=5) [("magnificent", True), ("outstanding", True), ("insulting", True), ("vulnerable", True), ("ludicrous", True)]. The show_most_i nformative_features() method will print out the results from most_informative_features() and will also include the probability of a feature pair belonging to each label.. >>> nb none for none _classifier.show_most_informative_features(n=5) Most Informative Features magnificent = True outstanding = True insulting = True vulnerable = True ludicrous = True pos : neg = 15.0 : 1.

0 pos : neg = 13.6 : 1.0 neg : pos = 13.

0 : 1.0 pos : neg = 12.3 : 1.

0 neg : pos = 11.8 : 1.0.

The informative ness, or information gain, of each feature pair is based on the prior probability of the feature pair occurring for each label. More informative features are those that occur primarily in one label and not the other. Less informative features are those that occur frequently in both labels.

Copyright © . All rights reserved.