저작자 표시 비영리
신고

'Programming > General Topics' 카테고리의 다른 글

고급 Bash 스크립팅 가이드  (0) 2010.08.10
URL Encoder  (0) 2010.06.28
BSTR(BASIC string)  (0) 2010.06.28
Sleep(0) 과 Sleep(1) 의 차이  (0) 2010.06.28
C의 메모리 관리 구조  (0) 2010.06.12
exec() family  (0) 2009.10.02
Posted by exahz
www.codeguru.com/cpp/cpp/cpp_mfc/article.php/c4029/

URL Encoding

Chandrasekhar Vuppalapati (view profile)
May 22, 2003
Environment: VC++, MFC

Introduction

The purpose of the article is to design a C++ class that does URL encoding. The motivation behind this article was that, in my previous project, I need to post data from a VC++ 6.0 application, which was required to be URL encoded. I have searched the MSDN to include a class or API that returns a URL encoded value for a given string input, but I haven't found one. So, I had to come out with my own URLEncode C++ class.

The URLEncoder.exe is a MFC dialog-based application that uses the URLEncode class.

Process

URL encoding is a special process that makes sure that all the characters are "safe" to transmit across the Internet. Some characters have special meaning to various programs involved in sending the data across the Internet.

For example, a carriage return has an ASCII value of 13. Programs involved in sending you "FORM" data may consider this to mean the end of a line of data.

Traditionally, all Web applications transfer data between the client and server by using the HTTP or HTTPS protocols. There are basically two ways in which a server receives input from a client:

  1. Data can be passed in the HTTP headers (either via cookies or a posted form), or
  2. It can be included in the query portion of the requested URL.

When data is included in a URL, it must be specially encoded to conform to proper URL syntax. On the Web server side, the data is automatically decoded. Consider the following URL, where data is posted as a query string parameter.

Example: http://WebSite/ResourceName?Data=Data

Where Web Site is the URL Name
Resource Name is either the ASP or Servlet Name.
Data is the one that is to be posted to the Web Server. This requires to be encoded if the MIME type is .Content-Type: application/x-www-form-urlencoded.

RFC 1738

The RFC 1738 specification defining Uniform Resource Locators (URLs) restricts the characters allowed in a URL to a subset of the US-ASCII character set. This poses a limitation because HTML, on the other hand, allows the entire range of the ISO-8859-1 (ISO-Latin) character set to be used in documents. This leads to the case of, if the data to be uploaded is in the form HTML post (or as a part of Query string), all the HTML data to be encoded.

ISO-8859-1 (ISO-Latin) Character Set

The following table, ISO-8859-1, contains the complete ISO-8859-1 (ISO-Latin) character set, corresponding to the first 256 entries. The table provides each character ISO 8859-1Position(its decimal code), Description, Entity Number, Hex-Decimal Values, and HTML Result. Broadly, the range can be divided into Safe and Unsafe characters as follows.

Character range(decimal) Type Values Safe/Unsafe
0-31 ASCII Control Characters These characters are not printable Unsafe
32-47 Reserved Characters ' '!?#$%&'()*+,-./ Unsafe
48-57 ASCII Characters and Numbers 0-9 Safe
58-64 Reserved Characters :;<=>?@ Unsafe
65-90 ASCII Characters A-Z Safe
91-96 Reserved Characters [\]^_` Unsafe
97-122 ASCII Characters a-z Safe
123-126 Reserved Characters {|}~ Unsafe
127 Control Characters ' ' Unsafe
128-255 Non-ASCII Characters ' ' Unsafe

All the ASCII characters that are unsafe are required to encoded; for example, ranges (32-47, 58-64, 91-96, 123-126).

Below is the table that describes why these characters are not safe.

Character Unsafe Reason Character Encode
"<" Delimiters around URLs in free text %3C
> Delimiters around URLs in free text %3E
. Delimits URLs in some systems %22
# It is used in the World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. %23
{ Gateways and other transport agents are known to sometimes modify such characters %7B
} Gateways and other transport agents are known to sometimes modify such characters %7D
| Gateways and other transport agents are known to sometimes modify such characters %7C
\ Gateways and other transport agents are known to sometimes modify such characters %5C
^ Gateways and other transport agents are known to sometimes modify such characters %5E
~ Gateways and other transport agents are known to sometimes modify such characters %7E
[ Gateways and other transport agents are known to sometimes modify such characters %5B
] Gateways and other transport agents are known to sometimes modify such characters %5D
` Gateways and other transport agents are known to sometimes modify such characters %60
+ Indicates a space (spaces cannot be used in a URL) %20
/ Separates directories and subdirectories %2F
? Separates the actual URL and the parameters %3F
& Separator between parameters specified in the URL %26

How It Is Done

URL encoding of a character is done by taking the character's 8-bit hexadecimal code and prefixing it with a percent sign ("%"). For example, the US-ASCII character set represents a space with decimal code 32, or hexadecimal 20. Thus, its URL-encoded representation is %20.

URLEncode: URLEncode is a C++ class, which does URL encoding for a given string of data. The CURLEncode class has the following member functions.

  • isUnsafeString
  • decToHex
  • convert
  • URLEncode

The URLEncode() method does the encoding process. URLEncode checks each character in the string to see whether the character is safe or unsafe (isUnsafe). If the character is unsafe, the character is replaced with the .%. HEX value (convert) and appended to the original string.

Code Snippet

class CURLEncode
{
private:
  static CString csUnsafeString;
  CString (char num, int radix);
  bool isUnsafe(char compareChar);
  CString convert(char val);

public:
  CURLEncode() { };
  virtual ~CURLEncode() { };
  CString (CString vData);
};

bool CURLEncode::isUnsafe(char compareChar)
{
  bool bcharfound = false;
  char tmpsafeChar;
  int m_strLen = 0;
  
  m_strLen = csUnsafeString.GetLength();
  for(int ichar_pos = 0; ichar_pos < m_strLen ;ichar_pos++)
  {
    tmpsafeChar = csUnsafeString.GetAt(ichar_pos);
    if(tmpsafeChar == compareChar)
    {
      bcharfound = true;
      break;
    }
  }
  int char_ascii_value = 0;
  //char_ascii_value = __toascii(compareChar);
  char_ascii_value = (int) compareChar;

  if(bcharfound == false &&  char_ascii_value > 32 &&
                             char_ascii_value < 123)
  {
    return false;
  }
  // found no unsafe chars, return false
  else
  {
    return true;
  }

  return true;
}

CString CURLEncode::decToHex(char num, int radix)
{
  int temp=0;
  CString csTmp;
  int num_char;

num_char = (int) num;
  if (num_char < 0)
    num_char = 256 + num_char;

  while (num_char >= radix)
    {
    temp = num_char % radix;
    num_char = (int)floor(num_char / radix);
    csTmp = hexVals[temp];
    }

  csTmp += hexVals[num_char];

  if(csTmp.GetLength() < 2)
  {
    csTmp += '0';
  }

  CString strdecToHex(csTmp);
  // Reverse the String
  strdecToHex.MakeReverse();

  return strdecToHex;
}

CString CURLEncode::convert(char val)
{
  CString csRet;
  csRet += "%";
  csRet += decToHex(val, 16);
  return  csRet;
}

URLEncoder

References

URL Encoding: http://www.blooberry.com/indexdot/html/topics/urlencoding.htm.

RFC 1866: The HTML 2.0 specification (plain text). The appendix contains the Character Entity table: http://www.rfc-editor.org/rfc/rfc1866.txt.

The Web version of the HTML 2.0 (RFC 1866) Character Entity table: http://www.w3.org/MarkUp/html-spec/html-spec_13.html.

The HTML 3.2 (Wilbur) recommendation [This includes all character entities listed in HTML 2.0, plus new named entities covering the ISO 8859-1 120-191 range.]: http://www.w3.org/MarkUp/Wilbur/.

The HTML 4.0 Recommendation [Includes new Unicode character entities]: http://www.w3.org/TR/REC-html40/.

The W3C HTML Internationalization area: http://www.w3.org/International/O-HTML.html.

Downloads

URLEncoder Source Code - 42 Kb

저작자 표시 비영리
신고

'Programming > General Topics' 카테고리의 다른 글

고급 Bash 스크립팅 가이드  (0) 2010.08.10
URL Encoder  (0) 2010.06.28
BSTR(BASIC string)  (0) 2010.06.28
Sleep(0) 과 Sleep(1) 의 차이  (0) 2010.06.28
C의 메모리 관리 구조  (0) 2010.06.12
exec() family  (0) 2009.10.02
Posted by exahz
TAG Encoder, URL

BSTR 은 기본적으로 유니코드 글자들의 배열이다.

c++ 에서 BSTR 사용하기  
BSTR b1;

// 어떤 문장을 포함하는 새로운 BSTR 할당
b1 = SysAllocString(L "Testing BSTs");

//BSTR 표시
wprintf("%s",b1); 

//바이트 수 표시(ANSI 글자당 두 바이트)
wprintf("%s" bytes \n", SysStringByteLen(b1));

//글자 수 표시
wrpintf(L"%d" characters\n", SysStringLen(b1));

//BSTR 해제
SysFreeString(b1);

함수 명

SysAllocString : BSTR을 할당하고 문자열을 그 안으로 복사한다.
SysAllocStringByelen : ANSI 입력 문자열을 받아서 BSTR을 리턴한다.
SysAllocStringLen : 새로운 BSTR을 할당하고 지정한 개수의 글자를 복사한 후  null 문자를 추가한다.
SysFreeString : BSTR을 해제한다.
SysReAllocString : 새로운 BSTR을 할당하고 전달된 문자열을 복사한 후 예전 BSTR을 해제한다.
SysReAllocStringLen : 이전 BSTR에서 지정한 개수의 글자를 포함하는 새로운 BSTR을 생성하고 이전 BSTR은 해제한다.
SysStringByteLen BSTR의 길이(바이트 단위)를 리턴한다.
SysStringLen BSTR의 길이를 리턴한다.


저작자 표시 비영리
신고

'Programming > General Topics' 카테고리의 다른 글

고급 Bash 스크립팅 가이드  (0) 2010.08.10
URL Encoder  (0) 2010.06.28
BSTR(BASIC string)  (0) 2010.06.28
Sleep(0) 과 Sleep(1) 의 차이  (0) 2010.06.28
C의 메모리 관리 구조  (0) 2010.06.12
exec() family  (0) 2009.10.02
Posted by exahz

티스토리 툴바