This article in Google Help explains how to deal with special characters in Sitemaps that you can submit to Webmaster tools in order to increase the number of indexed pages of your website.
The main point is: the URLs must contain ASCII symbols only.
It can be done this way:
- (obvious) ampersand, both quotes and <> symbols must be encoded,
- Unicode symbols must be encoded, eg. ü must be converted to %FC sequence,
- URLs that you submit must follow the RFC-3986
If you use PHP, pay attention to one thing: it seems
rawurlencode should be used instead of the usual
urlencode since it’s follows the RFC-3986 as stated in PHP documentation.
You might be surprised, but a right choise of the project text encoding can affect the project file size and amount of bugs.
To avoid bugs of wrong presentation of text on your page, make sure that all database entities and the application server (PHP) use the same encoding. That helps to forget about issues connected with text presentation.
On database side, make sure you set the correct encoding to:
- import-export tools parameters (a big source of wrong encoding bugs),
- corresponding SQL server variables
On application server it’s usually just one query –
SET NAMES utf8
Pay attention, that utf8 might be not the best choise for your project: every non-English character needs 2-6 bytes of memory, so if you built a one-language (local) project with lots of database data, consider using a 1 byte encoding like windows-1251 and save about half of the space on server file system.