Understanding Ruby: String Encoding, ASCII & Unicode

How can strings of characters exist in a world where computers only understand ones & zeros?

Well…

Just like we can map a domain name to an IP address.

Or a barcode to a specific product.

We can…

Map numbers to characters!

Like 97 to "a".

Or 122 to "z".

That’s exactly how we can have characters in a world of numbers.

But what numbers go with what characters?

To answer that question we have invented different character mapping systems.

Starting with ASCII.

ASCII stands for “American Standard Code for Information Interchange”.

You can find an ASCII table, or you can ask Ruby to convert characters to their ASCII value.

Like this:

"a".ord
# 97

For multiple characters:

"abc".bytes
# [97, 98, 99]

If you have an integer you can get the associated character.

Like this:

97.chr
# "a"

ASCII encoding includes:

Control characters (like newlines, tabs, null)
Symbols (like parenthesis, equals signs, question marks)
Numbers (0-9)
Characters (a-z, A-Z)

As we’ll see later in this article, this range of characters is limited.

Why?

Because it doesn’t include characters & symbols from other languages, like Chinese or Japanese.

ASCII in The Real World

This whole mapping numbers to strings thing happens behind the scenes for you.

But there are some practical uses!

For example:

The HTTP specification doesn’t allow certain characters inside URLs.

But you can encode these invalid characters in ASCII format & most modern web servers will interpret them correctly.

example.com/a+++ => example.com/a%2B%2B%2B

What is %2B?

It’s the character +, in ASCII-encoded format.

You can also use this knowledge to transform characters.

For example:

If you look at the ASCII table, you’ll notice that you can convert a lowercase character into uppercase by subtracting 32 from it.

("a".ord - 32).chr
# "A"

That also works the other way around.

("A".ord + 32).chr
# "a"

Yes.

In Ruby, we have the upcase & downcase methods.

But this could be helpful to you in some kind of interview question, coding challenge, or similar situation.

What is Unicode?

ASCII can only encode up to 127 different characters (256 with extended ASCII), this limits what characters we can represent.

The solution?

Unicode.

Unicode is a string encoding system that can represent up to a million different characters.

That’s a lot more space than ASCII!

Now we can include characters from all sorts of languages, new symbols & even emojis.

Here’s some Unicode:

ɑΩϕβΣπ

These are characters from the Greek alphabet which can’t be displayed using ASCII.

How to Use Unicode in Ruby

Ruby has support for Unicode, it’s enabled by default since Ruby 1.9.

So you can do this:

π = 3.141592

Or this:

def ★★★
  puts "You get 3 stars, great job!"
end

★★★
# "You get 3 stars, great job!"

Pretty fun!

But probably not that practical to define methods & variables using these symbols because they aren’t in our keyboards.

In fact, there are valid, invisible Unicode characters.

Example:

def 
  puts "Invisible method"
end

This looks like a method without a name, which normally isn’t allowed.

But it works because of that invisible Unicode character!

String Encoding Methods

Ruby has methods for working with different encoding systems.

For example:

"abc".encoding.name
# "UTF-8"

There are a few special scenarios where the current encoding (encoding.name) doesn’t match the actual encoding of the string.

You can find this while reading data from a website, file, database or another external source.

This will result in an InvalidByteSequenceError.

If that happens you’ll need to change the encoding.

How?

Using the encode method:

"abcΣΣΣ".encode("ASCII", "UTF-8", undef: :replace)

# "abc???"

I’m converting from UTF-8 (Unicode) to ASCII, and because the Σ character is not available in ASCII, we tell Ruby to replace it.

By default, this replaces undefined characters with question marks.

But you can change that.

Like this:

"abcΣΣΣ".encode("ASCII", "UTF-8", invalid: :replace, undef: :replace, replace: "")

# "abc"

Or using the “fallback” option:

"abcΣΣΣ".encode("ASCII", "UTF-8", fallback: {"Σ" => "E"})

# "abcEEE"

This is saying:

“Replace all characters from UTF-8 (Unicode) to ASCII, use the fallback hash to translate characters that don’t exist in ASCII”.

Another method, force_encoding, changes the encoding without this translation step.

Example:

"abc½½½".force_encoding("iso-8859-1")

You can get a list of available encodings with the Encoding.aliases method.

Summary

You have learned how computers create characters from numbers by using encoding tables! You’ve also learned about ASCII & Unicode in Ruby.

Now open your editor & have some fun practicing 😃

Thanks for reading.

7 thoughts on “Understanding Ruby: String Encoding, ASCII & Unicode”

Milind

May 27, 2019 at 9:32 pm

Good one.
- Jesus Castello
  
  May 28, 2019 at 1:53 am
  
  Thank you 🙂
Serguei

May 28, 2019 at 8:40 am

Thank you very much, Jesus, for such a detailed explanation!
- Jesus Castello
  
  May 29, 2019 at 11:19 am
  
  Thanks for your support 🙂
Tim

July 2, 2019 at 6:21 pm

Really informative, thank you Jesus!
Tim

July 2, 2019 at 6:23 pm

Really informative, keep up the great articles, thanks!
- Jesus Castello
  
  July 2, 2019 at 11:58 pm
  
  Thanks for reading 🙂

Comments are closed.

ASCII in The Real World

What is Unicode?

How to Use Unicode in Ruby

String Encoding Methods

Summary

Related

7 thoughts on “Understanding Ruby: String Encoding, ASCII & Unicode”