How to get rid of punctuation using NLTK tokenizer?

Posted on

Question :

How to get rid of punctuation using NLTK tokenizer?

I’m just starting to use NLTK and I don’t quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn’t work with multiple sentences: dots are added to the last word.

Asked By: lizarisk


Answer #1:

Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')


['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']
Answered By: lizarisk

Answer #2:

You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

Or for unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   

and then use this string in your tokenizer.

P.S. string module have some other sets of elements that can be removed (like digits).

Answered By: rmalouf

Answer #3:

Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]



['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']
Answered By: Salvador Dali

Answer #4:

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a ‘str’ encoded with some encoding like ‘utf-8’).

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)
Answered By: Madura Pradeep

Answer #5:

I just used the following code, which removed all the punctuation:

tokens = nltk.wordpunct_tokenize(raw)


text = nltk.Text(tokens)


words = [w.lower() for w in text if w.isalpha()]
Answered By: palooh

Answer #6:

I think you need some sort of regular expression matching (the following code is in Python 3):

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]


['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

Should work well in most cases since it removes punctuation while preserving tokens like “n’t”, which can’t be obtained from regex tokenizers such as wordpunct_tokenize.

Answered By: vish

Answer #7:

Sincerely asking, what is a word? If your assumption is that a word consists of alphabetic characters only, you are wrong since words such as can't will be destroyed into pieces (such as can and t) if you remove punctuation before tokenisation, which is very likely to affect your program negatively.

Hence the solution is to tokenise and then remove punctuation tokens.

import string

from nltk.tokenize import word_tokenize

tokens = word_tokenize("I'm a southern salesman.")
# ['I', "'m", 'a', 'southern', 'salesman', '.']

tokens = list(filter(lambda token: token not in string.punctuation, tokens))
# ['I', "'m", 'a', 'southern', 'salesman']

…and then if you wish, you can replace certain tokens such as 'm with am.

Answered By: Quan Gan

Answer #8:

I use this code to remove punctuation:

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

And If you want to check whether a token is a valid English word or not, you may need PyEnchant


 import enchant
 d = enchant.Dict("en_US")
Answered By: Bora M. Alper

Leave a Reply

Your email address will not be published.