2007-05-15

AOL CSS Parser

Posted in Practical at 20:36:05 by streetprogramming

A little validation goes a long way …

A client called to mention that their site looked like it had been up all night snorting cocaine with our President when viewed in AOL. After complaining a good deal about AOL and who uses that crap anymore and so forth I decided to check it out. As it turns out AOL does seem to use Internet Explorer as a COM component. Unfortunately it does not appear to use IE’s CSS parser.

After more time than I’m willing to admit to it was pointed out that the page did not appear to be using all the CSS files specified. After validating the CSS files it turned out that there was a single character at the end of the file which ruined the entire site: *.

2007-03-03

Unique URIs

Posted in Practical at 17:34:03 by streetprogramming

Here’s an obscurity … if you’re acquainted with perl’s URI package, you might have written code such as the following:

use URI;

my $base = URI->new( 'http://some.com/folder/' );
my $uri = URI->new_abs( '../file.xml', $base );

print "$uri\n";
http://some.com/file.xml

The output is the base URI plus file. This is very useful for using relative links on a site, as it is automatically handled for you. But there is odd behavior afoot. Consider this:


use URI;

print URI->new("../../../foo")->abs("http://some.com/deep/folder/"), "\n";
http://some.com/../foo

Note how the .. is kept in the absolute URI. Now, and this example is taken right from the URI POD:


use URI;

print URI->new("../../../foo")->abs("http://some.com/deep/folder/"), "\n";

$URI::ABS_REMOTE_LEADING_DOTS = 1;

print URI->new("../../../foo")->abs("http://some.com/deep/folder/"), "\n";

http://some.com/../foo
http://some.com/foo

Cool! It took out the extra .., effectively normalizing the URI. But wait:

use URI;

$URI::ABS_REMOTE_LEADING_DOTS = 1;

my $uri = URI->new( 'http://some.com/folder/../file.xml' );

print "$uri\n";
http://some.com/folder/../file.xml

Why didn’t this normalize the URL? For this I have no answer, nor can be bothered to read into the URI.pm code to figure it out. I assume a URI instantiated is left mostly intact, in the hopes the user knows what they are doing.

Now why is this a problem? Imagine writing a robot that needs to traverse an entire site, but only wants to visit each link only one time. You and I can tell that http://some.com/folder/../file.xml and http://some.com/file.xml are the same, but how do we tell the same to our storage mechanism (likely an associative array)?

Now usually this isn’t a problem because most sites have relative links such as ../file.xml, and such normal-ness, but what if a site has http://some.com/folder/../file.xml as a link? Don’t say it won’t happen – I have found such a site, and it pains me to no end.

The solution then?


sub normalize_uri {
  my $uri = shift();
  my @segments = reverse( grep { $_ ne '.' } $uri->path_segments );
  my @new_segments;

  my $skip_next = 0;
  for ( my $i = 0; $i < scalar( @segments ); $i++ ) {
    if ( $skip_next ) {
      $skip_next = 0;
      next;
    }

    if ( $segments[ $i ] eq '..' ) {
      $skip_next = 1;
      next;
    }

    unshift( @new_segments, $segments[ $i ] );
  }

  $uri->path_segments( @new_segments );

  return $uri;
}

Notice that we start by stripping all . markers from the URI – they are effectively meaningless. The next item of note is that we have reversed the entire path portion of the URI, so as to be able to handle .. more easily. The reversal lets us say ‘if the current segment is .., then the next segment must be skipped’. Finally, there is no need to reverse the array again, if we use unshift as opposed to push – that is to say push onto the head as opposed to the tail.

The Proof


#!/usr/bin/perl 

use strict;

use URI;
use Data::Dumper;

my %tests = (
  'http://toplevel.com/./index.php/folder/deep/../file.xml' => 'http://toplevel.com/index.php/folder/file.xml',
  'http://some.net/../folder/../deep/../file.xml' => 'http://some.net/file.xml',
  'http://www.place.com' => 'http://www.place.com',
  'http://www.place.com/to/rest/and/eat.html' => 'http://www.place.com/to/rest/and/eat.html'
);

foreach ( keys %tests ) {
  my ( $is, $should_be ) = ( $_, $tests{ $_ } );
  my $uri = URI->new( $is );
  my $was = normalize_uri( $uri );

  if ( $was ne $should_be ) {
    print STDERR "$was is not $should_ben";
  }
  else {
    print "$was == $should_ben";
  }
}

sub normalize_uri {
  my $uri = shift();
  my @segments = reverse( grep { $_ ne '.' } $uri->path_segments );
  my @new_segments;

  my $skip_next = 0;
  for ( my $i = 0; $i < scalar( @segments ); $i++ ) {
    if ( $skip_next ) {
      $skip_next = 0;
      next;
    }

    if ( $segments[ $i ] eq '..' ) {
      $skip_next = 1;
      next;
    }

    unshift( @new_segments, $segments[ $i ] );
  }

  $uri->path_segments( @new_segments );

  return $uri;
}

Plugs & Shoutouts

A shameless plug to Carousel 30, who pays the bills.

A shout out to Red Tree Systems, LLC (we’ll explore why they have street cred in a later installment).

Ghetto Java keeps it real, and has a much stronger focus than myself.

2007-02-17

CURL Ups

Posted in Practical at 00:26:02 by streetprogramming

Let me put this bluntly: if you’re a web developer without CURL in your arsenal, you’re weak. You’ll get eaten alive out there kid. Missing this part of your training probably means that you’re missing out on a lot of the lower level details about HTTP, and possibly networking in general, even TCP/IP. Not that you need all of this information, of course, but it does make your understanding a lot deeper, and will therefore allow you to solve a much greater range of problems.

The website http://curl.haxx.se has the following words to describe it:

curl is a command line tool for transferring files with URL syntax, supporting FTP, FTPS, HTTP, HTTPS, SCP, SFTP, TFTP, TELNET, DICT, FILE and LDAP. curl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload, proxies, cookies, user+password authentication (Basic, Digest, NTLM, Negotiate, kerberos…), file transfer resume, proxy tunneling and a busload of other useful tricks.

That’s an understatement using entirely too many words. CURL is many things, but in this case it is our tool to test and inspect various low level details such as headers and cookies. It can also be used in a wget-style to download remote files.

An easy example – Remote Viewing

You know about cat, right? RIGHT? Well, here’s a simple rcat, or remote cat:

$> curl http://www.google.com

CURL, with a minimal number of arguments, simply prints the body of the response. In this case, the HTML for google’s home page is returned in an ugly format. But CURL can do so much more. Let’s see how we can check out the full response from google:

$> curl --include http://www.google.com
HTTP/1.1 200 OK
Cache-Control: private
Content-Type: text/html
Set-Cookie: PREF=ID=54955a80f222999f:TM=1171683757:LM=1171683757:S=5inJ1k22Or-gt3sO; expires=Sun, 17-Jan-2038 19:14:07 GMT; path=/; domain=.google.com
Server: GWS/2.1
Transfer-Encoding: chunked
Date: Sat, 17 Feb 2007 03:42:37 GMT

Note that this time, instead of just the response body, we got the response header. We can see that google’s server gives us the 200 OK response code, and is giving us text/html. They want to set a cookie that doesn’t expire for many years, and they’re giving us what they believe to be the current date, in GMT.

Dig the reverse:

$> curl --verbose http://google.com
* About to connect() to www.google.com port 80
* Trying 216.239.37.99... * connected
* Connected to www.google.com (216.239.37.99) port 80
> GET / HTTP/1.1
User-Agent: curl/7.13.1 (powerpc-apple-darwin8.0) libcurl/7.13.1 OpenSSL/0.9.7l zlib/1.2.3
Host: www.google.com
Pragma: no-cache
Accept: */*< HTTP/1.1 200 OK
< Cache-Control: private
< Content-Type: text/html
< Set-Cookie: PREF=ID=71dceb8afa870409:TM=1171685077:LM=1171685077:S=006IFHwoAhP5YKnt; expires=Sun, 17-Jan-2038 19:14:07 GMT; path=/; domain=.google.com
< Server: GWS/2.1
< Transfer-Encoding: chunked
< Date: Sat, 17 Feb 2007 04:04:37 GMT

This time we can see what we passed to the server, and what it sent back in addition to the response body. Note the information we send about ourselves – this is not unlike what your browser or other user agent is passing along. User-Agent describes the agent used to make this request. The Host header identifies the site via domain name that you wish to access. This allows us to use “virtual” hosting – several names served from the same IP address.

Wow. Cool. More stuff.

$> curl --trace TRACE.txt http://www.google.com >/dev/null

Forget about the output this time. Check out the bad ass TRACE.txt file. That’s showing you everything that CURL is doing, which is important when you start using CURLib in your apps. What I find especially interesting is the chunked reads.

How about a custom header?

$> curl --verbose --header 'X-MyApp-Token: 23fa3af3af3eda3efa3f' http://localhost/
* About to connect() to localhost port 80
* Trying ::1... * connected
* Connected to localhost (::1) port 80
> GET / HTTP/1.1
User-Agent: curl/7.13.1 (powerpc-apple-darwin8.0) libcurl/7.13.1 OpenSSL/0.9.7l zlib/1.2.3
Host: localhost
Pragma: no-cache
Accept: */*
X-MyApp-Token: 23fa3af3af3eda3efa3f

You can see that we’re passing a custom header in form <name>: <value> to the server. The above example might be useful for communicating with a third party, passing along a token or otherwise identifying piece of information.

Well, there’s lots more, and we haven’t even scratched the surface, but your ignorance sickens me. I must go. Take your time on this. Marinate. Digest. One of these days I’ll show you how to hijack a session, presumably yours. It’s more than just childish pranks – it’s useful.