A quick tour of traditional NLP

佚名 7年前 (2019-04-09) 随笔 632人围观抢沙发百度已收录

A quick tour of traditional NLP

Copora, Tokens, and Types

corpus : a text dataset

tokens : tokens correspond to words and
numeric sequences separated by white-space characters or punctuation.

SRE实战互联网时代守护先锋，助力企业售后服务体系运筹帷幄！一键直达领取阿里云限量特价优惠。

tokenizing text

import  spacy
nlp = spacy.load('en)
text = "Mary, don't slap"

Lemmas and Stems

Lemmas : Lemmas are root forms of words. Consider the verb fly. It can be inflected into
many different words—flow, flew, flies, flown, flowing, and so on—and fly is the
lemma for all of these seemingly different words.

Stems : Stemming is the poor-man’s lemmatization. 3 It involves the use of handcrafted
rules to strip endings of words to reduce them to a common form called stems.