also visit: Theatre of Noise | Soundings

about: this site | me

subscribe: RSS

25 May 2006

Against Web Standards

Following on my article on web standards and my attempt at getting a Blogger page to conform, I thought I would address some of the common critiques of web standards. I have formatted this page as a FAQ. If you have any further questions, please post a comment. I'd like to see this become a lively debate.

Who or what is the W3C?


The World Wide Web Consortium (W3C) was founded in 1994 by Tim Berners-Lee with the involvement of MIT, CERN, DARPA, Japan's Keio University and the European Commission.

Who cares about the W3C?


Apparently, some of the largest technology firms in the world, including Adobe, AOL, Apple, AT&T, BBC, Cisco, Ericsson, Fujitsu, Google, Hitachi, IBM, Intel, Japan Broadcasting Corporation (NHK), Lexmark, Matsushita, NEC, Nokia, Nortel, Oracle, Siemens AG, Sony, Sun, Toshiba, Xerox, Yahoo!... even Microsoft. The full members list is impressive.

But the W3C site is ugly


The W3C site is designed for developers and bears the look of a specification or technical document. Taking its audience into account I find it about 6 on the ugly scale. I also think Yahoo! is quite ugly but use it every day just the same.

Ugliness does not translate into lack of usability, though I prefer sites that are easy on the eyes.

Won't using web standards make my site ugly?


Following the standards does not have anything to do with the aesthetic of a website. Conforming sites can be ugly or pretty, boring or exciting, just like nonconforming sites.

However, following standards makes it significantly easier to modify the look of your site and try out different designs. A good example of this is css Zen Garden which invites outside designers to redo the home page. Switchable stylesheets allow you to view the page in radically different designs. You may be amazed that none of these alter the HTML in any way!

The W3C has no standards so why follow them?


The W3C publishes what it calls "recommendations", and has done so more than ninety times. Whether you want to call these "standards" or not is a matter of semantics. For the record, the W3C themselves use these terms interchangeably.

Why should I support standards if [insert popular site here] doesn't?


There are several parts to this complicated issue.

First, it depends on which validation standard and tool you are using. Google produces 33 errors in the online W3C Validator but zero in the W3C tool Tidy.

Second, and related, standards are not an absolute target. Some large and complicated sites must support older browsers (eg: Internet 5 and its ilk). These user agents did not support standards and so the site cannot if it hopes to render properly. (There are partial ways around this.)

Third, there is the problem of inertia. A large firm may in fact be updating pages to be standards-compatible, but hasn't got there yet. If a mix of technologies for generating web pages is not compliant, it can take a lot of work and investment to reach that target. All the more reason to start off on the right foot.

Fourth, some sites may not see all the benefits, so the organisation is not pushed to change. Yahoo! and Google are the search engines so they obviously care less about good SEO!

Fifth, while there are bad examples, there are also surprising cases of shining goodness, for example Microsoft.

Sixth, despite all the above, some sites do in fact need work. For example, Google does not state a doctype on their page. This is poor behaviour I cannot explain! They should hire me (or maybe you) as a consultant. Who said big companies never make mistakes?

23 May 2006

Obscuring Your Email From Spammers

Fighting spam is a full-time job for some, an annoyance at least for others. Those of us who have a website and want to plainly display an email address for the convenience of our readers have a problem: we are also plainly displaying this for spam-friendly web scrapers. Over the years a good number of techniques have arisen to deal with this, which I will outline in this article.

For each technique I will discuss any drawbacks. There is no perfect method but even a small effort is better than none. The theory is this: spammers operate on the basis of volume. It is not worth their while to slow down to do any sort of complicated parsing since the payoff is only a few more addresses out of millions (how many people bother with these techniques?).

That said, I will embark upon this wonderful journey of discovery as a testament to the inventiveness of those who have pioneeered these methods. Whether they are justified or not!

Plain Text
Code as: <a href="mailto:first.last@domain.com">
first.last@domain.com</a>


Looks like: first.last@domain.com

This amounts to not doing anything. Your email address is displayed openly. Any spam can be dealt with on the receiving end. Your readers need no special browsers or capabilities and can click on the link with expected results.

Character Entities
Code as: first&#46last&#64;domain&#46;com

Looks like: first.last@domain.com

HTML character entities are decoded by the browser back into displayable type, but look like some sort of gibberish at the markup level. However, they are easy to automatically decode, and so cannot be recommended as a way of avoiding spammers.

Typographic Obsfucation
Code as: first dot last at domain dot com
f i r s t . l a s t @ d o m a i n . c o m
first.REMOVE.last@domain.com


By spelling out parts of your address, adding spaces, using synonyms, or including obviously extraneous words, you are relying on a reader to visually decode and rewrite your email. This technique, like most of those that follow, precludes the use of a convenient mailto link, because if it's convenient to the reader then it is to the spider as well.

Unfortunately this means extra work, which will reduce the number of messages you get. Presuming you want to communicate, this is a bad thing. Also, web scrapers may be smart enough to piece together a valid email, since there are only so many substitutions that must be tried. (Removing whitespace is almost too easy.)

Substitute With A Graphic
Code as:
<img src="myaddress.png" border="0" alt="my email address">

Looks like:

Converting the text to an image file definitively foils spiders, but is a barrier to your users and may break usability guidelines. The reason is that you cannot safely put your address in the clear in the ALT attribute, so those without a visual display get no useful info. (A similar technique uses Flash files, but has no additional advantages.)

JavaScript Generation

There are many possible variants on the theme of programmatically creating the email link. This is my own, which contains some enhancements.

Note that significant strings that a spider might be set to recognise (eg: "mailto") are broken up. Also, character entities are used for the symbols, plus the components of the email address are listed backwards.
<script language="JavaScript"> <!--
function InsertEmail(t) {
var chardot = '.';
var charat = '@';
var commune = new Array('com', 'domain', 'last', 'first');

document.write('<a href="ma');
document.write('ilto:');
document.write(commune[3]);
document.write(chardot);
document.write(commune[2]);
document.write(charat);
document.write(commune[1]);
document.write(chardot);
document.write(commune[0]);
document.write('">');
document.write(t);
document.write('</a>');
}
// --> </script>


In practice one would remove the function to an external JS file, making it even less likely to be found and parsed. The problem with this technique is that it restricts your readers to JavaScript-enabled browsers. In practice this may not be a significant limitation.

JavaScript Generation With Obsfucation

To take the previous technique even further you can obsfucate the JavaScript. The online tool Enkoder creates something like this:

<script type="text/javascript">
/* <![CDATA[ */
function hivelogic_enkoder(){var kode=
"kode=\"oked\\\"=')('injo).e(rsvere).''t(lispe.od=kdeko\\\\;k\\\"do=e\\\"\\"+
"\\\\\\\\\\kode\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\==dxke)o(}dcCeaoCrohfmgri.tn"+
"=rxS8+1;+2)=<c(0ic3f);(-AidtCeaocreho.=d{k+ci)h+g;et.ndlkeio0<i;r=f('o=;;'"+
"\\\\x\\\\\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\@\\\\g{nh0r\\"+
"\\0\\\\\\\\\\\\\\\\\\\\\\\\0\\\\\\\\,\\\\\\\\\\\\\\\\\\\\\\\\+\\\\gfFhdrFu"+
"rkipjul1wq@u{V;.4>.5,@?f+3lf6i,>+0DlgwFhdrfuhkr1@g~n.fl,k.j>hw1qgonhlr3?l>"+
"u@i+*r@>>*A{(%g/BDs5(ksDibtug4uoFsyjrzzgx4lyuor@gz(oCskbnlgx(&kBo.}zzxk4{t"+
"us%ihjr@\\\\g\\\\\\\\\\\\\\\\\\\\\\\\n\\\\\\\\=\\\\\\\\\\\\\\\\\\\\\\\"d\\"+
"\\ke;o\\\\\\\\\\\\\\\\\\\"\\\\\\\\\\\\kode=kode.split('').reverse().join('"+
"')\\\"\\\\\\\\\\\\x;'=;'of(r=i;0<ik(do.eelgnht1-;)+i2={)+xk=do.ehcratAi(1+"+
"+)okedc.ahAr(t)ik}do=e+xi(k<do.eelgnhtk?do.ehcratAk(do.eelgnht1-:)'';)\\\""+
"\\\\e=od\\\"kk;do=eokeds.lpti'()'r.verees)(j.io(n'')\";x='';for(i=0;i<(kod"+
"e.length-1);i+=2){x+=kode.charAt(i+1)+kode.charAt(i)}kode=x+(i<kode.length"+
"?kode.charAt(kode.length-1):'');"
;var i,c,x;while(eval(kode));}hivelogic_enkoder();
/* ]]> */
</script>

This has no real advantage over the more comprehensible JavaScript technique unless you believe spammers possess high intelligence and cracking abilities. I don't think so.

Encryption

This technique stores only an encrypted address on the page, decrypted by JavaScript. It certainly stops spam, but is overkill for most purposes. If you wish to use it, try Email Protector, which uses 10-bit RSA encryption.

Form With CGI

Some sites refuse entirely to publish their addresses and accept email only through a web form. Since the email address is only used on the server side, this fully protects the site from spiders. Unfortunately readers receive an interface inferior to their email software and are restricted from keeping a record of the sent email. Though popular, forms are a barrier to communication and I do not recommend them.

CSS Display None
<style>
span.hide {display:none;}
</style>
first.last@domain<span class="hide">null</span>.com


This technique interrupts the email address with some HTML which is set to not display by way of CSS. This could be useful in combination with some of the plain obsfucation techniques but likely adds little to them.

CSS Pseudo-Class
<style>
address:after {
content: " <first.last\40domain.com>";
}
</style>
<address>me</address>


I found this tricky method at Newt Edge. It relies on the CSS2 pseudo-class :after, so older browsers plus Opera and Safari are out of luck. Again, if the style is in a separate file it reduces the chance the address will be found. But it's still almost in the clear.

CSS Backwards Text
<style>
.backwards {unicode-bidi:bidi-override; direction: rtl;}
</style>
<span class="backwards">moc.niamod@tsal.tsrif</span>


This technique is taken from the CSS Play site. It works only in current browsers which support a full range of CSS2 attributes, which means only Explorer 7 and FireFox 1.5. It's cute though.

Conclusions

It's easy enough to set up an experiment and see what techniques resist spam. Back in 2004 basic obfuscation and JavaScript worked just fine. I do not think more complicated techniques are justified, though it's fun to see what people come up with.

Notes:
1. When almost finished this article, I found a similar one, though the author does not credit any of the techniques.
2. It's a shame I cannot properly demo some of these techniques, but Blogger gets in the way.

19 May 2006

Validating a Blogger Page

Since writing my article on web standards, I have been trying to eliminate validation errors on this blog. Of course I am hampered by the fact that Blogger inserts HTML codes of its own accord. Here are the steps I've taken and the successes I've had. Hopefully this case study will aid others.

I thought that the most appropriate page to focus on was my article on web standards. Validating this page requires changing the markup on the page itself, as well as the overall blog page template. A trial run through the W3C validator produced well over one hundred errors -- yikes!

The first issue is that Blogger uses an XHTML doctype, which I am not overly familiar with. There is no point me trying to change the doctype, since tags inserted automatically are going to conform (we can hope) to XHTML. So I needed to get used to this and alter my coding to suit. A trip to the spec was called for.

I started with the easy stuff. The validator discovered many "empty" tags, such as <br>, <meta> and <img>, which needed to be written with a closing slash, like so: <br />. (And to write that out explicitly in HTML requires even further strategems. View the source for this page if you are interested.)

I have some coding habits left over from HTML 3 days. For example, I add border attributes to <img> tags and language attributes to script tags. These are deprecated, so I removed them. I noticed that some of the code I've got from other sources makes the same mistakes, the Paypal and StatCounter markup for example. There was also one use of a literal ampersand, which is to say & as part of text. This needs to be represented by the proper character entity, namely &amp;.

Further issues with code I'd copy'n'pasted included the omission of a block-level element within <noscript> tags, easily remedied by adding in a <div>. Or so I thought. This still caused some odd problems, so I ended up removing the block entirely. The justification is that I'm not concerned with tracking readers without Javascript. Presumably the RSS feed is good enough for them.

I use an inline stylesheet, since there is no way to upload a separate file to Blogger. The <style> tag was missing the attribute type="text/css". In other places validation found actual typos, like a missing quote in an attribute.

Two specific errors caused me problems. In order to get posts showing up on the home page with a "more..." option, I use a method that wraps part of the post in a <span> that is set to be hidden on the main page but visible on item pages. This is a problem because within a page I might use header tags to indicate structure. This nesting of a block-level element inside an inline element is invalid. The fix is to use a <div> in place of the <span>, with exactly the same stylings. I fixed this for the page in question, but doing so across the site is going to mean editing every single article, something I'm unlikely to want to do. Some things are better to get right the first time!

The other issue is that Blogger inserts breaks automatically to space out content in articles. This means that even between list items there are <br /> tags -- again invalid syntax. The only fix for this is to not use line breaks between a closing </li> and the next <li>. Ugly, but it works.

I encountered an issue with the Paypal form, and this one took me quite a while to investigate. All of the <input> tags within the form generated a nesting error. The only way I could think of fixing this was to put a <div> immediately within the <form> tags, and since this worked I deduce that it is a requirement. I can find no documentation on this, however.

One problem I did not encounter, but which is worth noting, is that XHTML requires tags in all lower-case letters. I write this way already, but for some of you that could be difficult to get used to.

With all of this done, I took one last pass through the validator and got only two errors. Both of these are for the same reason, use of the deprecated name attribute, and both are in the code for the Blogger Bar at the top of the screen. That will have to stay.

Following standards is not an all or nothing affair. Reducing validation errors gets you closer to conformity, and even if you do not reach 100% your page will reap some of the benefits. I'm happy with the progress I've made and what I've learned from the process.

18 May 2006

Why Web Standards?

I am converting a client's website to follow accepted web standards. Specifically, I am validating the HTML and CSS using the W3C tools. Besides following the standards because it's the "right thing" to do, there are many tangible benefits to businesses. I have compiled a comprehensive list of these, along with a set of references. I hope this helps you make the same decision... it's one big step towards a friendlier world wide web!

Introduction

In this case I am using the phrase "web standards" to refer to the series of recommendations made by the W3C and promoted by WaSP. Besides ensuring proper syntax, these standards mandate that tags are used for the appropriate task.

The biggest change in development from the pre-standards mode of working is that tables cannot be used to dictate page structure. Rather they are reserved for their designated task: containing tabular data. It is not easy making this switch, since a number of browser quirks and lapses in standards enforcement complicates the target environment. But it's still a lot better than abandoning standards entirely, and the time spent learning the new methods is an investment in the future.

Standards Benefits

Proper separation of style (CSS) from structure (HTML) allows:
  1. easier customisation (for example by swapping stylesheets) for a more customer-centric experience
  2. reduced maintenance time and costs


Page size is reduced, resulting in:
  1. reduced bandwidth and lower hosting costs
  2. faster page response improving user experience


Conformity increases device compatibility for:
  1. better compatibility with older browsers
  2. increased usability by people with vision and other disabilities*
  3. increased usability by mobiles, PDAs and other browsing technologies
  4. greater accessibility to search engine robots
  5. more predicatable browser rendering (to a point!)
  6. fewer problems with future browsers (eg: IE 7)


Development process is standardised and time reduced, specifically it's easier to:
  1. find errors using validators
  2. gauge conformity across multiple developers
  3. convert compliant documents to other formats
  4. process web server error logs when errors are reduced
  5. hand off development to a new team


Besides these benefits, there may be contractual or legal ramifications if your website is found to be deficient in meeting codes for disabled access. For example, if you are building sites for the US Federal Government, you need to conform to Section 508 of the Rehabilitation Act. In the United Kingdom you should be aware of The Disability Discrimination Act.

The single biggest difficult in standards compliance is finding developers who understand and appreciate the standards. As usual, the world is swimming in developers, most of whom are far from capable in this regard.

References

Why Websites Look Different in Different Browsers

The Business Case For Web Accessibility

The Business Value of Web Standards

The Way Forward with Web Standards

webXACT accessibility tool (once Bobby)

List of Checkpoints for Web Content Accessibility Guidelines 1.0

Should my Business Website be Compliant?.


* There were an estimated 31 million visually impaired people in the Americas and Europe in 2002. Depending on your business segment, that's a large potential market.

16 May 2006

MentalWealth Updated Yet Again

Here I am another week older with yet another update to my MoinMoin theme MentalWealth. Version 0.95 adds some custom icons and allows additional actions to be displayed in the sidebar. Will the features never end?

A helpful user alerted me to the fact that additional actions were not accessible from my theme, since I had done away with the nasty-looking pull-down menu. Not wanting to throw that particular baby out with the bath-water, I have implemented an additional panel, "More Actions", but also added an easy switch in the code so you can turn it off.

In some ways MoinMoin is not very easy to customise, unless one starts hacking code. Today I came across a situation where I wanted to add some icons that could be easily accessed throughout a wiki. It seems that the smilie protocol* would be the neatest way to go. But, although each theme carries with it a repository of icons, there is no way to customise the smilies that will display, beyond replacing existing files.

A small code hack was required, as discussed in the readme file. Now you have access to a nice set of bug tracking icons:

bug icons

To display these on a page, you simply type the following smilies: |b |a |r |g and |z, remembering that they must be surrounded by whitespace. These correspond to the icons for black, amber, red, grey and zap. What you use them for is up to you.

As usual, MentalWealth is available at the MoinMoin ThemeMarket and for convenience also here.


* The Smilie Protocol sounds like a new spy flick, no?