|
|||
|
Let's say that I have an article. What I want to do is read in this
file and have the program skip over ever instance of the words "the", "and", "or", and "but". What would be the general strategy for attacking a problem like this? |
|
|
||||
|
||||
|
|
|
|||
|
On 11/11/10 09:07, chad wrote:
> Let's say that I have an article. What I want to do is read in > this file and have the program skip over ever instance of the > words "the", "and", "or", and "but". What would be the > general strategy for attacking a problem like this? I'd keep a file of "stop words", read them into a set (normalizing case in the process). Then, as I skim over each word in my target file, check if the case-normalized version of the word is in your stop-words and skipping if it is. It might look something like this: def normalize_word(s): return s.strip().upper() stop_words = set( normalize_word(word) for word in file('stop_words.txt') ) for line in file('data.txt'): for word in line.split(): if normalize_word(word) in stop_words: continue process(word) -tkc |
|
|||
|
On 11/11/10 15:07, chad wrote:
> Let's say that I have an article. What I want to do is read in this > file and have the program skip over ever instance of the words "the", > "and", "or", and "but". What would be the general strategy for > attacking a problem like this? If your files are not too big I'd simply read them into a string and do a string replace for each word you want to skip. If you want case insensitivity use re.replace() instead of the default string.replace() method. Neither are elegant or all that efficient but both are very easy. If your use case requires something high performance then best keep looking ![]() Roger. |
|
|||
|
On 2010-11-11 08:07, chad wrote:
> Let's say that I have an article. What I want to do is read in this > file and have the program skip over ever instance of the words "the", > "and", "or", and "but". What would be the general strategy for > attacking a problem like this? I realize that you may need or want to do this in Python. This would be trivial in an awk script. |
|
|||
|
chad <cdalten@gmail.com> writes:
> Let's say that I have an article. What I want to do is read in this > file and have the program skip over ever instance of the words "the", > "and", "or", and "but". What would be the general strategy for > attacking a problem like this? Something like (untested): stopwords = set (('and', 'or', 'but')) def goodwords(): for line in file: for w in line.split(): if w.lower() not in stopwords: yield w Removing punctuation is left as an exercise. |
|
|||
|
Am 11.11.2010 21:33, schrieb Paul Watson:
> On 2010-11-11 08:07, chad wrote: >> Let's say that I have an article. What I want to do is read in this >> file and have the program skip over ever instance of the words "the", >> "and", "or", and "but". What would be the general strategy for >> attacking a problem like this? > > I realize that you may need or want to do this in Python. This would > be trivial in an awk script. There are several ways to do this. skip = ('and','or','but') all=[] [[all.append(w) for w in l.split() if w not in skip] for l in open('some.txt').readlines()] print all If some.txt contains your original question, it returns this: ["Let's", 'say', 'that', 'I', 'have', 'an', 'article.', 'What', 'I', 'want', 'to ', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program', 'skip', ' over', 'ever', 'instance', 'of', 'the', 'words', '"the",', '"and",', '"or",', '" but".', 'What', 'would', 'be', 'the', 'general', 'strategy', 'for', 'attacking', 'a', 'problem', 'like', 'this?'] But this _one_ way to get there. Faster solutions could be based on a regex: import re skip = ('and','or','but') all = re.compile('(\w+)') print [w for w in all.findall(open('some.txt').read()) if w not in skip] this gives this result (you loose some punctuation etc): ['Let', 's', 'say', 'that', 'I', 'have', 'an', 'article', 'What', 'I', 'want', ' to', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program', 'skip', 'over', 'ever', 'instance', 'of', 'the', 'words', 'the', 'What', 'would', 'be', 'the', 'general', 'strategy', 'for', 'attacking', 'a', 'problem', 'like', 'this '] But there are some many ways to do it ... |
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|