Page 1 of 1

RV_DecodeURL replaces '+' with ''

Posted: Tue Sep 27, 2016 8:04 am
by cmm
Hello!

I ran into an issue when pasting a hyperlink containing a '+' symbol. It is being replaced by ''.

Reproduction:
1. Create a hyperlink that has the address of: https://test.com/CE+/2
in any tool that supports hyperlinks (Outlook etc.) and give it a name.

2. Right click the hyperlink and select "Copy hyperlink".

3. Paste into a RichView component.

4. '+' will be gone.

This is of course by design since the code is explicitly saying to replace '+' with '':

if Result = '+' then
Result := ' ';

What is the rationale behind this? I ask since it is breaking my hyperlink.

Posted: Tue Sep 27, 2016 8:23 am
by Sergey Tkachenko
In encoded URLs, pluses represent spaces. In the new encoding rules, spaces are encoded as "%20", but some applications still use "+".
So, if you need "+" in your URL, write it as "%2B".

Posted: Tue Sep 27, 2016 8:55 am
by cmm
I understand that "%2B" represents a "+" and that space represents a "+". However I do not think it is true that "+" should represent a space.

If you look at google.com for example and search "Hello+world" you will get "Hello%2Bworld" in the address field. However, if you type "Hello world" into google you will get "Hello+world".

This means that they retain the "+" and will not alter the meaning of the input.

My main issue is that the hyperlink with a "+" comes from an external tool and the hyperlink when pasted should keep working with that tool.

Thank you for taking your time to reply to my problem.

Posted: Tue Sep 27, 2016 12:41 pm
by Sergey Tkachenko
Well, it's interesting.

On one hand, your Google example confirms what I said: space becomes "+" after encoding. So, when Google decodes https://www.google.ru/?q=a+b, the search string contains "a b", not "a+b". So, Google works like TRichView.

On other hand, I placed two files "a b.html" and "a+b.html" to http://www.trichview.com/support/forumfiles/
As you can see,
http://www.trichview.com/support/forumfiles/a+b.html opens "a+b.html", not "a b.html".
So my own server does not decode "+".

So, the situation depends on the server.
I can add an option, but I am not sure how to implement it. A global variable?

Posted: Tue Sep 27, 2016 12:47 pm
by Sergey Tkachenko
There is an option to turn off any decoding and encoding in TRichView at all.
If you include rvoPercentEncodedURL in TRichView.Options, TRichView assumes that all link targets stored in it are already %-encoded. So it does not encode links on import and does not encode on export.

But be careful with this option: when you assign link targets (in tags) directly, you must be sure that they are encoded. For example, such links must not contain spaces, they must be "%20".
User must enter encoded URLs in TrvActionInsertHyperlink's dialog as well.

Posted: Fri Sep 30, 2016 7:08 am
by cmm
Hello!

Thank you again for your replies.

I have been experimenting with the rvoPercentEncodedURL a bit and it seems to behave a bit differently if you paste an hyperlink into the document or if you use the Insert Hyperlink dialog.

If you paste it will change spaces to %20 but it will retain the "+".
If you use the Insert Hyperlink dialog it will insert it with regular spaces instead of %20 and it will also retain the "+".

So it seems like the paste is still doing encoding of spaces but Insert Hyperlink is not doing that.
But be careful with this option: when you assign link targets (in tags) directly, you must be sure that they are encoded. For example, such links must not contain spaces, they must be "%20".
User must enter encoded URLs in TrvActionInsertHyperlink's dialog as well.
I do not understand the implications of this, what happens if it contains spaces? From what I have seen spaces works fine but I might be missing something.

I will have to do some more experimenting but the rvoPercentEncodedURL might be the solution. The main reason why I am bothering you with this is because it feels like a text-editor should not alter the addresses being inserted, it should be the webservers that does that. Unless it is required somehow to store the links properly.

Thanks again.

Posted: Fri Sep 30, 2016 12:44 pm
by Sergey Tkachenko
Yes, in the mode, even when URLs are not supposed to be encoded/decoded, TRichView may encode spaces when exporting to DocX (and may be HTML, I do not remember right now).
It's a kind of emergency fix: valid encoded URLs must not contain space characters. And if you save URLs with space character in DocX as they are, MS Word will consider the whole DocX file as invalid and will not open it.

Posted: Thu Oct 27, 2016 7:02 pm
by Maxim Masiutin
I have a feeling that pluses may only be replaced to spaces (and vice versa) in application/x-www-form-urlencoded key-value pairs.

The RFC-1866, paragraph 8.2.1. subparagraph 1. says: "The form field names and values are escaped: space characters are replaced by `+', and then reserved characters are escaped").

Here is an example of such string in URL: "http://example.com/over/there?name=foo+bar". So, only after "?" spaces can be replaced by "+". In other cases, they should be replaced to %20. BTW, RFC-1866 is obsolete. So it's better to never replace spaces by "+".

It's better to percent-encode all "unreserved" characters as defined in RFC-3986, p.2.3.

Here is a code example:

(* percent-encode all unreserved characters as defined in RFC-3986, p.2.3 *)
function UrlEncodeRfcA(const S: AnsiString): AnsiString;
var
I: Integer;
c: AnsiChar;
begin
// percent-encoding, see RFC-3986, p. 2.1
Result := S;
for I := Length(S) downto 1 do
begin
c := S;
case c of
'A' .. 'Z', 'a' .. 'z', // alpha
'0' .. '9', // digit
'-', '.', '_', '~':; // rest of unreserved characters as defined in the RFC-3986, p.2.3
else
begin
Result := '%';
Insert('00', Result, I + 1);
Result[I + 1] := HexCharArrA[((Byte(C) shr 4) and $F)];
Result[I + 2] := HexCharArrA[(Byte(C) and $F)];
end;
end;
end;
end;

function UrlEncodeRfcW(const S: UnicodeString): AnsiString;
begin
Result := UrlEncodeRfcA(Utf8Encode(S));
end;

Posted: Fri Oct 28, 2016 8:48 am
by Sergey Tkachenko
TRichView itself never encodes spaces as pluses, it encodes them as %20.
The problem is not in encoding, but in decoding of pluses.

TRichView URL encoding procedure encodes less characters than this code, it encodes only '%', #13, #10, ' ', '+', '"'. Other characters are untouched, because they never cause problems.