Go Back   Rhinocerus > Newsgroup > Newsgroup comp.lang.ruby

Reply
 
Thread Tools Display Modes
  #1 (permalink)  
Old 03-31-2012, 08:21 PM
Paul
Guest
 
Posts: n/a
Default How to convert MS Word special characters to HTML codes?

Hi there, I have been pouring over a character conversion problem for
a day now and need some help. I created a ruby script that scans an
Excel spreadsheet and puts the content into a custom XML file. (works
fine) I am using Ruby 1.9.2.

When I tried importing the XML file into the destination program, it
fails. It turns out the Excel spreadsheet data had some text copied
from MS Word and so every now and then there is an embedded 'long
dash' or ellipses character that is above the regular ascii set, so
the import function fails due to these unexpected binary characters.

I can find these lines and specific characters when the script reads
the data. What I'd *like* to do is convert these special (unicode)
characters to their HTML equivalents. After hours of searching blog
posts and skimming through old posts here, I am still stuck.

If I can't find a way to convert these characters, I'll just remove
them. I'd really like to try and keep them somehow.

Can someone please help point me to some references or offer some
advice on how I can convert them to HTML or ascii equivalents?

Here's an example. In ruby 1.9.2, I see the following line in my
output file:

"* \x85 ellipsis\n"

-> according to an HTML lookup table, I could replace \x85 with


Is there an easy way to convert these characters? I've tried the CGI
and ICONV libraries and they both skip over these characters. I would
prefer to have a routine that can find and replace each of the special
characters rather than write a regex for each character myself. I have
encountered 5 special characters so far. There might be more as I go
through the data.

Suggestions?

TIA.

Paul.
Reply With Quote
Alt Today
Advertising
 
and become member of Rhinocerus
Standard Sponsored Links

  #2 (permalink)  
Old 04-02-2012, 03:25 PM
Paul
Guest
 
Posts: n/a
Default Re: How to convert MS Word special characters to HTML codes?

Nevermind. I think I figured out the problem.

While creating my script, I ran it from SciTE (version 3.0.4 in win7).
Something about running the script from that environment works
differently than when I run it from the command line.

When I run the script from the command line, everything works
correctly and the CGI library converts the characters well enough.

No worries.

Reply With Quote
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off




All times are GMT. The time now is 01:59 PM.


Copyright ©2009

LinkBacks Enabled by vBSEO 3.3.0 RC2 © 2009, Crawlability, Inc.