Recently in Web Category

blogs.asia

| No Comments | No TrackBacks |

No, I don't own it. But with you help together we can win it.

blogs.asia domain name received more than one application (one of them was mine) during .asia landrush period and will be auctioned via dotasia.pool.com site soon (most likely in couple weeks). This will be a closed auction - only those who sent application during landrush period will be able to participate.

Now, I don't know how many people wanted this domain and how serious they are about bidding for this name. I was trying to register blogs.asia just for fun and probably will fail on the auction. So if anybody has any ideas about what blogs.asia can become and willing to spend some money on it, drop me a line and we can try to get it together.

Most Popular Words 2008 (Google, Live)

| 1 Comment | No TrackBacks |

I was doing some Web popularity research and found very cool data set collected by Philipp Lenssen back in 2006 and 2003. This is basically Google page count for 27000 English vocabulary words.

I decided to repeat the process on a wider word set via at least two search engines (Google and Live Search). So I combined Philipp's 27000+ vocabulary with Wiktionary (a wiki-based open content dictionary) English index and got quite comprehensive 74000+ vocabulary which reflects contemporary English language usage on the net. And then I collected page count number for each word reported by Google and Live Search.

And here are some visualizations. Unfortunately while Swivel can do do great interactive visualizations including clouds, they only support static graph for embedding. So don't hesitate to click on the graphs to see a better visualization (e.g. cloud for 100 top words).

Top 30 most popular words by Google, Live (numbers are in billions):
Most Popular Words (Google version)    Most Popular Words (Live version)

As expected, top is occupied by common English words and common internet related nouns.

Top 30 most popular words by Google vs Live:

 Most Popular Words (Google vs Live)

Top 30 gainers (Google, 2006 to 2008). Good to see x 48 page count gain for "twitter", the rest I cannot explain. Can you?

oracular x 163.6
planchette x 153.7
newsy x 93.5
posse x 81.7
nymphet x 75.2
jewelelry x 65.6
twitter x 48.6
paling x 48.2
waylain x 45.2
outmatch x 45.2
outrode x 41.6
pod x 41.0
phizog x 35.6
sinology x 29.9
overdrew x 26.7
multistorey x 26.5
nonstick x 25.6
nun x 25.4
pedicure x 24.8
pillory x 24.8
panty x 24.3
outridden x 24.0
nip x 23.2
naturism x 23.2
organddy x 23.0
piccolo x 22.0
paladin x 21.6
notability x 21.2
breadthways x 20.9

And finally top 10 the longest words along with page count (Google, 2008):

<w c="1460">tetaumatawhakatangihangakoauaotamateaurehaeaturipukapihimaungahoronukupokaiwhenuaakitanarahu</w>
<w c="5620">taumatawhakatangihangakoauauotamateaturipukakapikimaungahoronukupokaiwhenuakitanatahu</w>
<w c="60">methionylglutaminylarginyltyrosylglutamylserylleucylphenylalanylal...serine</w>
<w c="62300">llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch</w>
<w c="20100">taumatawhakatangihangakoauauotamateapokaiwhenuakitanatahu</w>
<w c="285">aequeosalinocalcalinosetaceoaluminosocupreovitriolic</w>
<w c="69000">pneumonoultramicroscopicsilicovolcanoconiosis</w>
<w c="1010">hepaticocholangiocholecystenterostomies</w>
<w c="18">hepaticocholangiocholecystenterostomy</w>
<w c="74500">hippopotomonstrosesquippedaliophobia</w>

Unsurprisingly, the longest  word is still 92 letters long name of a hill in New Zealand, this one is hard to beat.

 

The raw data sets (page count for 74000+ words) are available in XML format and also on Swivel (Google version, Live version) where you can play with them visualizing and comparing in your way. Any more interesting visualization or comparison for this data set can you came up with? Enjoy.

Crowdsourcing in action: results

| No Comments | No TrackBacks | ,

I was writing about a pilot the Library of Congress was doing with Flickr. I measured also number of tags, notes and comments and repeated the process several times during last 2 months. Here are some numeric results:

Library of Conress pilot on Flickr

As expected, while tags, notes and comments still coming, in general the lines are almost flat after 50 days.

Averages: 4.85 unique tags,  0.39 notes, 1.34 comments per photo.

The Library of Congress blog shared some real results:

And because we government-types love to talk about results, there are some tangible outcomes of the Flickr pilot to report: As of this writing, 68 of our bibliographic records have been modified thanks to this project and all of those awesome Flickr members.

Well, that doesn't impress much, but they must be happy as they have posted 50 more photos.

The Library of Congress has launched an interesting pilot project with Flickr, which can be characterized as a crowdsourcing experiment.

They have uploaded 3115 copyright-free photos from two of the most popular collections and in return they hope the Flickr community will enhance the collections by labeling and commenting images:

We want people to tag, comment and make notes on the images, just like any other Flickr photo, which will benefit not only the community but also the collections themselves. For instance, many photos are missing key caption information such as where the photo was taken and who is pictured. If such information is collected via Flickr members, it can potentially enhance the quality of the bibliographic records for the images.

Crowdsourcing is a special case of a human-based computation, a technique for solving problems that computers just incapable of (or if you wish - problems for which humans cannot yet program computer to solve). The simple idea behind human-based computation is to outsource certain steps to humans. And if you outsource it to the crowd you get crowdsourcing:

Crowdsourcing is a neologism for the act of taking a task traditionally performed by an employee or contractor, and outsourcing it to an undefined, generally large group of people, in the form of an open call. For example, the public may be invited to develop a new technology, carry out a design task, refine an algorithm or help capture, systematize or analyze large amounts of data (see also citizen science).

  Think about tagging images (Google Image Labeler), answering arbitrary human questions (Yahoo! Answers), selecting the most interesting stories (Digg, reddit), inventing better algorithms (Netflix prize) or even monitoring the Texas-Mexican border.

Btw, did you know that Google didn't invented Google Image Labeler, but licensed Luis von Ahn's ESP Game? And that while the crowd is working for free on Google Image Labeler, improving Google's image search, Google never shares collected tags? I don't think that's fair. Moreover I think that's unfair. Results of crowdsourcing must be available to the crowd, right?

Anyway, how is the pilot going? From the Flickr blog we learn first results:

In the 24 hours after we launched, you added over 4,000 unique tags across the collection (about 19,000 tags were added in total, for example, “Rosie the Riveter” has been added to 10 different photos so far). You left just over 500 comments (most of which were remarkably informative and helpful), and the Library has made a ton of new friends (almost overwhelming the email account at the Library, thanks to all the “Someone has made you a contact” emails)!

That was after 24 hours. Today, 10 days later the results (according to my little script) are: 2440 comments, 570 notes, 13077 unique tags.

That's almost 500% more comments and 300% more tags. In average 0.8 comments and 4.2 tags per image. Not bad, but not very impressive too. I will be interesting to check it again in a month to see what's is the trend.

It's also interesting to see when bad guys start to abuse it. Google Image Labeler was abused less than a month after its launch. And Google Image Labeler is protected from abuse by using only tags selected by both players independently, while on Flickr there is no protection whatsoever.

I also figured out that while these 3115 photos were posted to Flickr, there are about 1 million others available online in the Library of Congress's own Prints & Photographs Online Catalog, which is really astounding. Check out this picture of General Allenby's entrance into Jerusalem back in 1917:

Scanned from b&w film copy negative, no known restrictions on publication, freely available as uncompressed tiff (1,725 kilobytes). Now that's real wow.

Do you realize that PDF documents can contain embedded Javascript code? Yes, it can. Adobe Acrobat Reader supports Javascript 1.5 extended by Adobe and it allows such sweet things as dynamic PDF content and appearance manipulation, database-driven PDF documents, multimedia, layers, 3D, Flash in PDF (!) etc.

We've seen fancy PDF documents with animations, lame PDF calculators, but where is the real beef? Where is Web2-like stuff, where are AJAXy PDF eBooks?

The platform appears to be strong enough, the Adobe Acrobat Reader market penetration must be huge, so why smart eBooks still nowhere to be seen? I can imagine lots of opportunities:

Autcomplete search field prepopulated with index words? That would improve search in huge documents, which is still a nightmare despite all Adobe efforts

Dynamic context ads in eBooks? Many would hate it, but authors and publishers would appreciate such revenue stream. Say, small text-based context ad on every 5th page wouldn't harm much. After all unlike Web pages eBooks usually have lots the real content so it must be easy to produce really well targeted context ads.

Social features in eBooks? That might be huge. eBook readers form natural social community, which is currently completely hidden. "Recent readers" sidebar, annotations, ratings, comments, chatting, "Digg this book"?

Autoupdating eBooks? "New book edition is published, get it here" or even "Book updates available, download?". Why not?

These are just few the most obvious ideas. Sure you can come up with more.

So far to let Google know about your sitemap you had to submit it  through Google Webmaster tools. Now, according to the Official Google Webmaster Central Blog you can just reference it in your robots.txt file:

Specifying the Sitemap location in your robots.txt file

You can specify the location of the Sitemap using a robots.txt file. To do this, simply add the following line:

Sitemap: <sitemap_location>

The <sitemap_location> should be the complete URL to the Sitemap, such as: http://www.example.com/sitemap.xml

This directive is independent of the user-agent line, so it doesn't matter where you place it in your file. If you have a Sitemap index file, you can include the location of just that file. You don't need to list each individual Sitemap listed in the index file.

Now that's much easier.

As you know W3C woke up recently and restarted HTML activity. WHAT Working Group, who was working independently on HTML5 since 2004 now proposes W3C to adopt HTML5 as starting point and to put their Google guy to be in charge of new W3C HTML 5.

By the way, the need to annotate HTML 5 with "W3C" means there is already a potential confusion. In fact WHATWG is working two specifications: Web Applications 1.0 and Web Forms 2.0, but who knows why they call them both HTML5. Now they are proposing W3C to adopt Web Applications 1.0 and Web Forms 2.0 ("WHATWG HTML5") along with editor as starting point to W3C next HTML version ("W3C HTML").

Anyway, here is the proposition and discussion:

Dear HTML Working Group,

HTML5, comprising the Web Apps 1.0 and Web Forms 2.0 specifications,  
is the product of many years of collaborative effort. It specifies  
existing HTML4 markup and APIs with much clearer conformance criteria  
for both implementations and documents. It specifies many useful  
additions, in many cases drawing on features that have existed in  
browser-based implementations for a long time. And it actively draws  
on feedback from implementors and content authors. Therefore, we the  
undersigned propose the following:

- that the W3C HTML Working Group adops the WHAT Working Group's  
HTML5 as the starting point for further HTML development
- that the W3C's next-generation HTML specification is officially  
named "HTML 5"
- that Ian Hickson is named as editor for the W3C's HTML 5  
specification, to preserve continuity with the existing WHATWG effort

If HTML5 is adopted as a starting point, the contents of the document  
would still be up for review and revision, but we would start with  
the existing text. A suitable next step might be a high-level review  
of functionality added and removed relative to HTML4.01, followed by  
focused discussion and review of individual topic areas, including  
both content already in the spec and proposed new features.  
Discussions should be guided by common principles along the lines of  
<http://esw.w3.org/topic/HTML/ProposedDesignPrinciples>

If the group is agreeable to these proposals, Apple, Mozilla and  
Opera will agree to arrange a non-exclusive copyright assignment to  
the W3 Consortium for HTML5 specifications.


L. David Baron, Mozilla Foundation
Lars Erik Bolstad, Opera Software ASA
Brendan Eich, Mozilla Foundation
Dave Hyatt, Apple Inc.
Håkon Wium Lie, Opera Software ASA
Maciej Stachowiak, Apple Inc.

 As you can see Mozilla, Opera, Apple and Google (Ian Hickson) are all here. Now W3C HTML WG chairs Chris Wilson (Microsoft) and Dan Connolly (W3C/MIT) have to decide. Interesting. So far seems like people on the public-html mail list like the idea, but personally I don't believe it's gonna happen. I'd like to be wrong though.

And here is another interesting tidbit:

If the HTMLWG adopts the WHATWG spec as a starting point, and asks me to edit the HTML spec, then there will only be one spec. The WHATWG spec and the HTML WG spec would be one and the same. 

Ian Hickson

And if not then what? Two different HTML 5 specifications? OMG. Interesting times ahead.

Amazon Context Links

| 1 Comment | No TrackBacks | ,

Amazon has launched Context Links Beta program. The idea is that you insert a little Amazon script into your pages and when the page is open in a browser the script identifies words and phrases it thinks are relevant and makes them links to whatever Amazon products.

I enabled the script on my blog's frontpage (pinky double underlined links) to see how relevant it is and here are the results:

  1. "WHATWG" - Designing with Web Standards (2nd Edition) by Jeffrey Zeldman
  2. "Google Apps" - "Google Maps Hacks" by Rich Gibson
  3. "Google Reader" - Google for Dummies by Brad Hill
  4. "Fuck Windows" - "Death is a Window by E.C. Blount
  5. "The Pragmatic Programmer's" - Der Pragmatische Programmierer. by Andrew Hunt

Well, sure there are lots of opportunities for improving relevance, but still not bad at all for beta

In ASP.NET when you building a server control that includes an HTTP handler you have this problem - the HTTP handler has to be registered in Web.config. That means it's not enough that your customer developer drops control on her Web form and sets up its properties. One more step is required - manual editing of the config, which is usability horror.

How do you make your customer aware she needs to perform this additional action? Documentation? Yes, but who reads documentation on controls? I know I never, I usually just drop it on the page and poke around its properties to figure out what I need to set up to make it working asap.

So here is nice trick how to avoid manual Web.config editing (found it in the ScriptAculoUs autocomplete web control).

  1. Make sure your control has a designer.
  2. In your control's designer class override ControlDesigner.GetDesignTimeHtml() method, which is called each time your control needs to be represented in design mode.
  3. In the GetDesignTimeHtml() method check if your HTTP handler in already registered in Web.config and if it isn't - just register it.
Here is a sample code that worth hundred words: 
using System;
using System.Web.UI.Design;
using System.Security.Permissions;
using System.Configuration;
using System.Web.Configuration;
using System.Windows.Forms;

namespace XMLLab.WordXMLViewer
{
    [SecurityPermission(SecurityAction.Demand, 
        Flags = SecurityPermissionFlag.UnmanagedCode)]
    public class WordXMLViewerDesigner : ControlDesigner
    {
        private void RegisterImageHttpHandler()
        {
            IWebApplication webApplication = 
                (IWebApplication)this.GetService(typeof(IWebApplication));

            if (webApplication != null)
            {
                Configuration configuration = webApplication.OpenWebConfiguration(false);
                if (configuration != null)
                {
                    HttpHandlersSection section = 
                        (HttpHandlersSection)configuration.GetSection(
                        "system.web/httpHandlers");
                    if (section == null)
                    {
                        section = new HttpHandlersSection();
                        ConfigurationSectionGroup group = 
                            configuration.GetSectionGroup("system.web");
                        if (group == null)
                        {
                            configuration.SectionGroups.Add("system.web", 
                                new ConfigurationSectionGroup());
                        }
                        group.Sections.Add("httpHandlers", section);
                    }
                    section.Handlers.Add(Action);
                    configuration.Save(ConfigurationSaveMode.Minimal);
                }
            }
        }


        private bool IsHttpHandlerRegistered()
        {
            IWebApplication webApplication = 
                (IWebApplication)this.GetService(typeof(IWebApplication));

            if (webApplication != null)
            {
                Configuration configuration = 
                    webApplication.OpenWebConfiguration(true);

                if (configuration != null)
                {
                    HttpHandlersSection section = 
                        (HttpHandlersSection)configuration.GetSection(
                        "system.web/httpHandlers");

                    if ((section != null) && (section.Handlers.IndexOf(Action) >= 0))
                        return true;
                }
            }
            return false;
        }


        static HttpHandlerAction Action
        {
            get
            {
                return new HttpHandlerAction(
                    "image.ashx", 
                    "XMLLab.WordXMLViewer.ImageHandler, XMLLab.WordXMLViewer", 
                    "*"
                );
            }
        }

        public override string GetDesignTimeHtml(DesignerRegionCollection regions)
        {
            if (!IsHttpHandlerRegistered() && 
                (MessageBox.Show(
                "Do you want to automatically register the HttpHandler needed by this control in the web.config?", 
                "Confirmation", MessageBoxButtons.YesNo, 
                MessageBoxIcon.Exclamation) == DialogResult.Yes))
                RegisterImageHttpHandler();
            return base.CreatePlaceHolderDesignTimeHtml("Word 2003 XML Viewer");
        }
    }
}
Obviously it only works if your control gets rendered at least once in Design mode, which isn't always the case. Some freaks (including /me) prefer to work with Web forms in Source mode, so you still need to write in the documentation how to update Web.config to make your control working.

I was cleaning up my backyard and found this control I never finished. So I did. Here is Word 2003 XML Viewer Control v1.0 just in case somebody needs it. It's is ASP.NET 2.0 Web server control, which allows to display arbitrary Microsoft Word 2003 XML documents (aka WordML aka WordprocessingML) on the Web so people not having Microsoft Office 2003 installed can browse documents using only a browser.

The control renders Word 2003 XML documents by transforming content to HTML preserving styling and extracting images. Both Internet Explorer and Firefox are supported.

Word 2003 XML Viewer Control is Web version of the Microsoft Word 2003 XML Viewer tool and uses the same WordML to HTML transformation stylesheet thus providing the same rendering quality.

The control is free open-source, download it here, find documentation here.

I'm doing interesting trick with images in this control. The problem is that in WordML images are embedded into the document, so they need to be extracted when transforming to HTML. And I wanted to avoid writing images to file system. So the trick is to extract image when generating HTML (via XSLT), assign it guid, put it into session and generate <img> src attribute requesting image by guid. Then when browser renders HTML it requests images by guid and custom HTTP handler gets them from the session.

Having HTTP handler in ASP.NET control posed another problem - how do you register HTTP handler in Web.config automatically? AFAIK there is no out of box solution for the problem, but happily I found a solution that covers major use case. Here is piece of documentation:

When you are adding the first Word 2003 XML Viewer Control in your Web project, you should see the following confirmation dialog: "Do you want to automatically register the HttpHandler needed by this control in the web.config?". You must answer Yes to allow the control to register image handler in the Web.config. If don't answer Yes or if you add the control not in Design mode, you have to add the following definition to the Web.config in the <system.web> section:
<httpHandlers>
   <add path="image.ashx" verb="*" type="XMLLab.WordXMLViewer.ImageHandler, XMLLab.WordXMLViewer" />
</httpHandlers>

Yep. the hint is the Design mode. I'll post about this trick tomorrow.

The usage is simple - just drop control and assign "DocumentSource" property (Word 2003 XML file you want to show).

I deliberately named this control "Word 2003 XML Viewer Control" to avoid confusion. But I'll update it to support Word 2007 as soon as there is Word 2007 to HTML transformation problem solution.

Any comments are welcome. Enjoy.

Many people use converting PDF to Word as a way to change to a more easily editable document that only PDF conversion can easily accomplish--if you didn't convert PDF to Word then you'd have to manually transcribe the document, while PDF to Word software does this in a few clicks.

Google Mobile Proxy

| No Comments | No TrackBacks | ,

This is not particularly new, but I didn't know about it before today and it did save my ass when I needed to browse non-mobile-friendly site on my phone really bad this morning.

It's http://www.google.com/gwt/n - Google mobile proxy service. It allows to browse to any web site using your mobile by "adapting it" - reformatting content and soft of squizzing it. As a matter of interest it also strips out Google AdSense ads. Cool.

After years wasted in XHTML and XForms development and before WHATWG totally taking over HTML, W3C woke up and restarted their HTML activity. Just about time.

Yes, believe it or not, but after HTML 4.01, which was finished back in 1999, W3C did nothing to improve HTML.

Meantime Google, Apple, Mozilla and Opera being disappointed in W3C lack of interest in HTML further development, have created WHATWG (Web Hypertext Application Technology Working Group), whose tagline is not less but "Maintaining and evolving HTML since 2004".

It's interesting to note that another major browser vendor never participated in WHATWG and guess who is chairing new W3C HTML working group? Chris Wilson, Microsoft and Dan Connolly, W3C/MIT (and if you look as far back as 2006/11 - it was Chris Wilson only).

The new W3C HTML working group is scheduled to deliver new HTML version (in both classic HTML and XML syntaxes) by 2010. It's 3 years only. I doubt W3C as it is now can deliver something as much important as HTML5 in just 3 years.

WHATWG must be pretty much pissed off now. From WHATWG blog:

Surprisingly, the W3C never actually contacted the WHATWG during the chartering process. However, the WHATWG model has clearly had some influence on the creation of this group, and the charter says that the W3C will try to “actively pursue convergence with WHATWG”. Hopefully they will get in contact soon.

Well, actually the chapter says more:

Web Hypertext Application Technology Working Group (WHATWG)
The HTML Working Group will actively pursue convergence with WHATWG, encouraging open participation within the bounds of the W3C patent policy and available resources.

Good enough. 

I'm only afraid that W3C can kill WHATWG and then bury HTML5 down in endless meetings settling down dependencies, IP issues, conflicting corporate interests and such. W3C can spend on HTML5 5-10 years easily.

Google Launches Apps Premier Edition

| 1 Comment | 1 TrackBack |

Google launches Premier Edition of the google.com/a - Google Apps. It's:

  • Gmail (10Gb mailbox), Google Talk, Google Calendar, Docs & Spreadsheets, Page Creator and Start Page
  •  99.9% uptime guarantee for email (only for email?)
  • opt-out for ads in email - doh!
  • Shared calendar
  • Single sign-on
  • User provisioning and management
  • Support for email gateway
  • Email migration tools (Limited Release)
  • 24/7 assistance, including phone support
  • 3rd party applications and services

for mere $50/year per user. Sounds tempting for microISVs and small businesses.

They say it's cheap alternative to Microsoft Office. I'm not convinced. $50 x 4-5 years = Microsoft Office price, but I'm not sure one can compare Google Apps and Microsoft Office featurewise (yet).

What's more interesting is  /. gang very cold reaction. Are sysadmins afraid Google is taking their job off?

R.I.P. GotDotNet

| No Comments | 1 TrackBack | ,

Microsoft decided to shut down GotDotNet site by July 2007. The official announcement goes like this:

Microsoft will be phasing out the GotDotNet site by July 2007.

Microsoft will phase out all GotDotNet functionality by July 2007. We will phase out features according to the schedule below. During the phase-out we will ensure that requests for features or pages that are no longer available will render enough information for you to understand what has changed. If you have any questions please don’t hesitate to contact the GotDotNet Support team.
We are phasing out GotDotNet for the following reasons:

  • Microsoft wants to eliminate redundant functionality between GotDotNet and other community resources provided by Microsoft
  • Traffic and usage of GotDotNet features has significantly decreased over the last six months
  • Microsoft wants to reinvest the resources currently used for GotDotNet in new and better community features for our customers
  • If you still hosting anything at the GotDotNet - here is your moving deadlines:

    Phase Out Schedule
    The GotDotNet phase out will be carried out in phases according the following timetable:

    Target Date
    Areas to be Closed

    February 20
    Partners, Resource Center, Microsoft Tools

    March 20
    Private workspaces, Team pages, Message Boards

    April 24
    GDN CodeGallery (projected date)

    May 22
    GDN User Samples (projected date)

    June 19
    GDN Workspaces (projected date)

    Well, obviously that was inevitable. GotDotNet sucked big despite any efforts made. Looks like Microsoft was learning how to do open source  project hosting on the web and GotDotNet was first that first pancake that is always spoiled. CodePlex definitely tastes better.

    There are couple of projects still hosted at the GotDotNet that I care about:

    • Chris Lovett's SgmlReader. Awesome tool for reading HTML via XmlReader. I suggested Chris to contribute SgmlReader to the Mvp.Xml project, let's see if he likes the idea.
    • XPathReader. Cool pull-based streaming XML parser supporting XPath queries. I'm admin there actually, so I think we are going to move XPathReader under the Mvp.Xml project umbrella real soon.

    Well, the GotDotNet is dead. Long live CodePlex!

    Google Reader reports subscriber counts

    | No Comments | 1 TrackBack | ,

    According to the Official Google Reader Blog Google feed crawler, Feedfetcher, started to report subscriber counts. "The count includes subscribers from Google Reader and the Google Personalized Homepage, and in the future may include other Google products that support feeds."

    What I found it interesting is that they do it via User Agent string. That's a very simple and nice solution and it's apparently not something new as I just looked at my blog log file and found subscribers info from a variety of feed crawlers:

    GET /blog/index.xml - x.x.x.x HTTP/1.1 Bloglines/3.1+(http://www.bloglines.com;+154+subscribers)
    GET /blog/index.xml - x.x.x.x HTTP/1.1 NewsGatorOnline/2.0+(http://www.newsgator.com;+99+subscribers)
    GET /blog/index.xml - x.x.x.x HTTP/1.1 Feedfetcher-Google;+(+http://www.google.com/feedfetcher.html;
    +167+subscribers;+feed-id=xxxxxxx)
    GET /blog/index.xml - x.x.x.x HTTP/1.1 Newshutch/1.0+(http://newshutch.com;+12+subscribers)
    

    And even such funny user agent as

    GET /blog/index.xml - x.x.x.x HTTP/1.1 Mozilla/5.0+(X11;+U;+Linux+i686;+en-US;+rv:1.2.1;
    +Rojo+1.0;+http://www.rojo.com/corporate/help/agg/;
    +Aggregating+on+behalf+of+15+subscriber(s)+online+at+http://www.rojo.com/?feed-id=xxx)+Gecko/20021130
    

    One might claim that's user agent header abuse, but I don't think so. Here is what RFC 2616 (HTTP) has to say:

    14.43 User-Agent
    The User-Agent request-header field contains information about the user agent originating the request. This is for statistical purposes, the tracing of protocol violations, and automated recognition of user agents for the sake of tailoring responses to avoid particular user agent limitations. User agents SHOULD include this field with requests. The field can contain multiple product tokens (section 3.8) and comments identifying the agent and any subproducts which form a significant part of the user agent. By convention, the product tokens are listed in order of their significance for identifying the application.

    Statistical purposes that's it.

    Oh, and while at it I should admit I hooked up to the Google Reader completely and haven't run RSS Bandit for months now. RSS Bandit has tons of cool features, but I always knew I need lightweight Web based feed reader. I tried Bloglines repeatedly, but only with Google Reader I found myself really comfortable from the first minute. That's great application.

    Googlomania

    | No Comments | No TrackBacks |

    Google somehow seems to be inaccessible (down?) from my place for at least 15 minutes now and I already feel uncomfortable  if not desperate. I want my mail, news and search back! Seriously, WTF? How come can I be so dependent on google? Ok, great, who else does search on the Web?

    This is disturbing story. An evil person doing phishing collected 56,000 MySpace user names and passwords and posted them to the "Full-Disclosure" mail list, which is open "unmoderated mailing list for the discussion of security issues" everybody can subscribe to.

    Now, of course the mail list is open and is archived by dozens of sites and of course MySpace could just change passwords for compromised users, but no, they instead decided to shut down one particular security site (seclists.org, why only this one?) that happens to be also archiving the "Full-Disclosure" mail list.

    And MySpace wanted to make it done real fast, so not bothering about bullshit like contacting seclists.org site owner or hosting company they contacted the domain name registrar (!) which happens to be well respected (so far) Go-Daddy.com, and somehow convinced them to remove the whole seclists.org domain name from the DNS. Now that's cool.

    The site is back on now, but Go-Daddy still defends seclists.org takedown, which smells more and more bad. Go-Daddy used to be my favorite domain name registrar. Now I'm (and probably many others) not sure. It's amazing how Go-Daddy turned MySpace problem into their own problem.

    Looking for ASP.NET hosting recommendations

    | 3 Comments | No TrackBacks |

    I'm finally decided to switch web hosting. I'm currently on webhost4life, but I'm really not up to that "4life" part. It's getting slower and slower while people seem to be runnig away from them.

    So I'm looking for ASP.NET hosting recommendations. I need to host at least 3 domains with DotnetNuke, CommunityServer, MS SQL, MySQL, nothing special.

    I've heard both good and bad words about ASPNix, but what about HostingFest? Where do you host your Windows stuff?

    Update: problem totally solved, got hosting I couldn't even dream about. All like me mentally retarded Microsoft MVPs - subscribe to the private "3rd offers" newsgroup now, I mean NOW!

    Planet Mobile Web

    | No Comments | No TrackBacks |

    W3C presented Planet Mobile Web site aggregating multiple blogs that discuss Mobile Web. It's hosted by the W3C Mobile Web Initiative. Really interesting reading. (feed) Subscribed.

    I skimmed some posts and it looks like the million dollar question every mobile blogger now think about is "how the hell we can get AJAX work on the mobile???". Mobile Web 2.0 is another revolution still waiting to happen.

    W3C announced the Mobile Web Best Practices 1.0 as Proposed Recommendation:

    Written for designers of Web sites and content management systems, these guidelines describe how to author Web content that works well on mobile devices. Thirty organizations participating in the Mobile Web Initiative achieved consensus and encourage adoption and implementation of these guidelines to improve user experience and to achieve the goal of "one Web." Read about the Mobile Web Initiative.

    That's actually a very interesting document. It's definitely a must for anybody targeting Mobile Web, which is a very different from the Web we know and it's not only because of limitations:

    Mobile users typically have different interests to users of fixed or desktop devices. They are likely to have more immediate and goal-directed intentions than desktop Web users. Their intentions are often to find out specific pieces of information that are relevant to their context. An example of such a goal-directed application might be the user requiring specific information about schedules for a journey they are currently undertaking.

    Equally, mobile users are typically less interested in lengthy documents or in browsing. The ergonomics of the device are frequently unsuitable for reading lengthy documents, and users will often only access such information from mobile devices as a last resort, because more convenient access is not available.

    Still there is dream about "One Web":

    The recommendations in this document are intended to improve the experience of the Web on mobile devices. While the recommendations are not specifically addressed at the desktop browsing experience, it must be understood that they are made in the context of wishing to work towards "One Web".

    As discussed in the Scope document [Scope], One Web means making, as far as is reasonable, the same information and services available to users irrespective of the device they are using. However, it does not mean that exactly the same information is available in exactly the same representation across all devices. The context of mobile use, device capability variations, bandwidth issues and mobile network capabilities all affect the representation. Furthermore, some services and information are more suitable for and targeted at particular user contexts (see 5.1.1 Thematic Consistency of Resource Identified by a URI).

    Some services have a primarily mobile appeal (location based services, for example). Some have a primarily mobile appeal but have a complementary desktop aspect (for instance for complex configuration tasks). Still others have a primarily desktop appeal but a complementary mobile aspect (possibly for alerting). Finally there will remain some Web applications that have a primarily desktop appeal (lengthy reference material, rich images, for example).

    It is likely that application designers and service providers will wish to provide the best possible experience in the context in which their service has the most appeal. However, while services may be most appropriately experienced in one context or another, it is considered best practice to provide as reasonable experience as is possible given device limitations and not to exclude access from any particular class of device, except where this is necessary because of device limitations.

    From the perspective of this document this means that services should be available as some variant of HTML over HTTP.

    What about "Web 2.0"? Well,

    No support for client side scripting.

    I recently got Motorola RAZR V3X - cool 3G phone (btw 3G really rocks) and all of a sudden I'm all about Mobile Web. This is fascinating technology with huge future. I've got lots of plans that gonna make me millions... if I only had some more spare time :(

    Wikipedia under the fire

    | No Comments | No TrackBacks |

    Just one morning topics:

    Wikipedia Used To Spread Virus

    "The German Wikipedia has recently been used to launch a virus attack. Hackers posted a link to an all alleged fix for a new version of the blaster worm. Instead, it was a link to download malicious software. They then sent e-mails advising people to update their computers and directed them to the Wikipedia article. Since Wikipedia has been gaining more trust & credibility, I can see how this would work in some cases. The page has, of course, been fixed but this is nevertheless a valuable lesson for Wikipedia users."

    Wikipedia and Plagiarism

    Daniel Brandt found the examples of suspected plagiarism at Wikipedia using a program he created to run a few sentences from about 12,000 articles against Google Inc.'s search engine. He removed matches in which another site appeared to be copying from Wikipedia, rather than the other way around, and examples in which material is in the public domain and was properly attributed. Brandt ended with a list of 142 articles, which he brought to Wikipedia's attention.... 'They present it as an encyclopedia," Brandt said Friday. "They go around claiming it's almost as good as Britannica. They are trying to be mainstream respectable.'"

    Long-Term Wikipedia Vandalism Exposed

    "The accuracy of Wikipedia, the free online encyclopedia, came into question again when a long-standing article on 'NPA personality theory' was confirmed to be a hoax. Not only had the article survived at Wikipedia for the better part of a year, but it had even been listed as a 'Good Article,' supposedly placing it in the top 0.2-0.3% of all Wikipedia articles — despite being almost entirely written by the creator of the theory himself."

    Good thing is that once discoveded all problems were immediately cleared and offenders banned. Wikipedia is really fast on fixing problems. The concusion is of course - Wikipedia is a great free resource, but don't believe everything you read there.

    SPI Dynamics has published a whitepaper "Ajax Security Dangers":

    While Ajax can greatly improve the usability of a Web application, it can also
    create several opportunities for possible attack if the application is not
    designed with security in mind. Since Ajax Web applications exist on both the
    client and the server, they include the following security issues:


    • Create a larger attack surface with many more inputs to secure
    • Expose internal functions of the Web application server
    • Allow a client-side script to access third-party resources with no builtin
    security mechanisms

    From all dangers one sounds the most horrible - authors claim that "Ajax Amplifies XSS". Ajax allows  cross-site scripting (XSS) attacks to spread like a virus or worm. And that's not an imaginary threats, the attacks are already happening.

    The first widely known AJAX worm was "Samy worm" or "JS.Spacehero worm" hits 1,000,000+ MySpace users in less than 20 hours back in 2005 and then again.

    In 2006 "The Yamanner worm" infested Yahoo Mail and managed to capture thousands email addresses and uploaded them to a still unidentified Web site.

    Provided that the problem wasn't that Yahoo or MySpace staff is incompetent:

    "The problem isn't that Yahoo is incompetent. The problem is that filtering JavaScript to make it safe is very, very hard," said David Wagner, assistant professor of computer science at the University of California at Berkeley

    It's for sure just a matter of time before Google or Microsoft Ajax based applications will be hacked, not to mention vendors with less experienced developers driving to Ajax by the hype and widely leveraging "cut and paste” coding technique.

    "JavaScript was dangerous before Ajax came around," noted Billy Hoffman, lead R&D researcher at SPI Dynamics Inc., a computer security firm. With the addition of Ajax functionality in many other Web applications, the problem is going to get worse before it gets better, he said.

    Pessimistic summary, but what would you expect in a "Worse is Better" world?