Go Back   Rhinocerus > Newsgroup > Newsgroup comp.lang.python

Reply
 
Thread Tools Display Modes
  #1 (permalink)  
Old 11-11-2010, 02:07 PM
chad
Guest
 
Posts: n/a
Default How do I skip over multiple words in a file?

Let's say that I have an article. What I want to do is read in this
file and have the program skip over ever instance of the words "the",
"and", "or", and "but". What would be the general strategy for
attacking a problem like this?
Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

  #2 (permalink)  
Old 11-11-2010, 02:48 PM
Tim Chase
Guest
 
Posts: n/a
Default Re: How do I skip over multiple words in a file?

On 11/11/10 09:07, chad wrote:
> Let's say that I have an article. What I want to do is read in
> this file and have the program skip over ever instance of the
> words "the", "and", "or", and "but". What would be the
> general strategy for attacking a problem like this?


I'd keep a file of "stop words", read them into a set
(normalizing case in the process). Then, as I skim over each
word in my target file, check if the case-normalized version of
the word is in your stop-words and skipping if it is. It might
look something like this:

def normalize_word(s):
return s.strip().upper()

stop_words = set(
normalize_word(word)
for word in file('stop_words.txt')
)
for line in file('data.txt'):
for word in line.split():
if normalize_word(word) in stop_words: continue
process(word)

-tkc



Reply With Quote
  #3 (permalink)  
Old 11-11-2010, 07:12 PM
r0g
Guest
 
Posts: n/a
Default Re: How do I skip over multiple words in a file?

On 11/11/10 15:07, chad wrote:
> Let's say that I have an article. What I want to do is read in this
> file and have the program skip over ever instance of the words "the",
> "and", "or", and "but". What would be the general strategy for
> attacking a problem like this?



If your files are not too big I'd simply read them into a string and do
a string replace for each word you want to skip. If you want case
insensitivity use re.replace() instead of the default string.replace()
method. Neither are elegant or all that efficient but both are very
easy. If your use case requires something high performance then best
keep looking

Roger.
Reply With Quote
  #4 (permalink)  
Old 11-11-2010, 07:33 PM
Paul Watson
Guest
 
Posts: n/a
Default Re: How do I skip over multiple words in a file?

On 2010-11-11 08:07, chad wrote:
> Let's say that I have an article. What I want to do is read in this
> file and have the program skip over ever instance of the words "the",
> "and", "or", and "but". What would be the general strategy for
> attacking a problem like this?


I realize that you may need or want to do this in Python. This would be
trivial in an awk script.
Reply With Quote
  #5 (permalink)  
Old 11-11-2010, 07:41 PM
Paul Rubin
Guest
 
Posts: n/a
Default Re: How do I skip over multiple words in a file?

chad <cdalten@gmail.com> writes:

> Let's say that I have an article. What I want to do is read in this
> file and have the program skip over ever instance of the words "the",
> "and", "or", and "but". What would be the general strategy for
> attacking a problem like this?


Something like (untested):

stopwords = set (('and', 'or', 'but'))

def goodwords():
for line in file:
for w in line.split():
if w.lower() not in stopwords:
yield w

Removing punctuation is left as an exercise.
Reply With Quote
  #6 (permalink)  
Old 11-11-2010, 08:18 PM
Stefan Sonnenberg-Carstens
Guest
 
Posts: n/a
Default Re: How do I skip over multiple words in a file?

Am 11.11.2010 21:33, schrieb Paul Watson:
> On 2010-11-11 08:07, chad wrote:
>> Let's say that I have an article. What I want to do is read in this
>> file and have the program skip over ever instance of the words "the",
>> "and", "or", and "but". What would be the general strategy for
>> attacking a problem like this?

>
> I realize that you may need or want to do this in Python. This would
> be trivial in an awk script.

There are several ways to do this.

skip = ('and','or','but')
all=[]
[[all.append(w) for w in l.split() if w not in skip] for l in
open('some.txt').readlines()]
print all

If some.txt contains your original question, it returns this:
["Let's", 'say', 'that', 'I', 'have', 'an', 'article.', 'What', 'I',
'want', 'to
', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program',
'skip', '
over', 'ever', 'instance', 'of', 'the', 'words', '"the",', '"and",',
'"or",', '"
but".', 'What', 'would', 'be', 'the', 'general', 'strategy', 'for',
'attacking',
'a', 'problem', 'like', 'this?']

But this _one_ way to get there.
Faster solutions could be based on a regex:
import re
skip = ('and','or','but')
all = re.compile('(\w+)')
print [w for w in all.findall(open('some.txt').read()) if w not in skip]

this gives this result (you loose some punctuation etc):
['Let', 's', 'say', 'that', 'I', 'have', 'an', 'article', 'What', 'I',
'want', '
to', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program',
'skip',
'over', 'ever', 'instance', 'of', 'the', 'words', 'the', 'What',
'would', 'be',
'the', 'general', 'strategy', 'for', 'attacking', 'a', 'problem',
'like', 'this
']

But there are some many ways to do it ...


Reply With Quote
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off




All times are GMT. The time now is 11:00 PM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.