Ticket #2409 (new defect)

Opened 8 months ago

Last modified 3 months ago

IRI (RFC 3987) support

Reported by: pjd Assigned to: pjd
Priority: normal Milestone:
Component: Nevow Severity: normal
Keywords: Cc:
Author: pjd Branch: branches/URL-IRI-2409

Description (Last modified by pjd)

(Context: #2408)

Nevow and Mantissa should properly support the IRI standard (RFC3987) for Unicode URIs.

Subtickets: #2476, #2522

Change History

11/21/07 13:40:15 changed by pjd

  • description changed.

(link the wiki page)

01/17/08 10:20:11 changed by pjd

  • description changed.

(note #2476)

02/15/08 11:04:54 changed by pjd

  • keywords set to IRI.

02/15/08 11:05:33 changed by pjd

  • description changed.

02/24/08 14:10:22 changed by pjd

  • description changed.

(#2522#2518)

04/22/08 09:30:39 changed by pjd

  • branch set to branches/URL-IRI-2409.
  • author set to pjd.

(In [15533]) Branching to 'URL-IRI-2409'

04/25/08 09:55:32 changed by pjd

  • keywords changed from IRI to review.
  • owner changed from pjd to exarkun.

This changes URL to always store unicode values internally, and should make parsing and serialization roughly RFC 3987 compliant.

It also fixes a bug that crept into URL.__repr__, and adds explicit text encoding/decoding to Mantissa's password reset mailing, which was affected by URL.netloc becoming unicode. (There might be more fallout lurking, which we'll have to fix as it becomes apparent.)

04/25/08 15:26:35 changed by exarkun

  • keywords deleted.
  • owner changed from exarkun to pjd.

Doc issues:

  • None of the URL attributes should be documented as flattenable. unicode is good and sufficient.
  • The documentation also doesn't explain whether the attributes quoted or unquoted.
  • Nor is there documentation for __init__ explaining whether the parameters are supposed to be quoted or unquoted when passed in.

Compatibility issues:

  • not sure what to do about this:
    • trunk:
      >>> str(URL('sc/%2Fheme', 'ho/%2Fst', ['pa/%2Fth'], [('qu/%2Fery', 'ar/%2Fg')], 'frag/%2Fment'))
      'sc/%2Fheme://ho/%2Fst/pa%2F%252Fth?qu%2F%252Fery=ar%2F%252Fg#frag%2F%252Fment'
      
    • branch:
      >>> str(URL('sc/%2Fheme', 'ho/%2Fst', ['pa/%2Fth'], [('qu/%2Fery', 'ar/%2Fg')], 'frag/%2Fment'))
      'sc//heme://ho//st/pa%2F%2Fth?qu%2F%2Fery=ar%2F%2Fg#frag%2F%2Fment'
      

Generally, backwards compatibility is preferable. The quoting/unquoting is particularly worrying, since there are lots of potential security issues here. Also the ideal API which takes components rather than a string takes unquoted strings, which is the opposite of what __init__ is doing now.

Minor code points:

  • basestring doesn't bring much; since arbitrary flattenable things aren't supported, I think we can avoid this check entirely.
  • URL.path shouldn't be expanded to support arbitrary serializables.
  • test_roundtrip is missing a docstring
  • some URL changes aren't tested (I think, I might have missed some existing test) but a number of the changes are for things I questioned above, so we should figure out what the functionality will actually be before adding tests, I guess.
jethro@divmod.org