Home » 2016 » September

Monthly Archives: September 2016

Hunting Down Memory Issues In Ruby: A Definitive Guide

September 9, 2016 8:23 am / Leave a comment

I’m sure there are some lucky Ruby developers out there who will never run into issues with memory, but for the rest of us, it’s incredibly challenging to hunt down where memory usage is getting out of hand and fix it. Fortunately, if you’re using a modern Ruby (2.1+), there are some great tools and techniques available for dealing with common issues. It could also be said that memory optimization can be fun and rewarding although I may be alone in that sentiment.

Hunting Down Memory Issues In Ruby

If you thought bugs were pesky, wait until you hunt for memory issues.

As with all forms of optimization, odds are that it will add code complexity, so it’s not worth doing unless there are measurable and significant gains.

Everything described here is done using the canonical MRI Ruby, version 2.2.4, although other 2.1+ versions should behave similarly.

It’s Not a Memory Leak!

When a memory issue is discovered, it’s easy to jump to the conclusion that there’s a memory leak. For example, in a web application, you may see that after you spin up your server, repeated calls to the same endpoint keep driving memory usage up higher with each request. There are certainly cases where legitimate memory leaks happen, but I’d wager they are vastly outnumbered by memory issues with this same appearance that aren’t actually leaks.

As a (contrived) example, let’s look at a bit of Ruby code that repeatedly builds a big array of hashes and discards it. First, here’s some code that’ll be shared throughout the examples in this post:

# common.rb
require "active_record"
require "active_support/all"
require "get_process_mem"
require "sqlite3"

ActiveRecord::Base.establish_connection(
  adapter: "sqlite3",
  database: "people.sqlite3"
)

class Person < ActiveRecord::Base; end

def print_usage(description)
  mb = GetProcessMem.new.mb
  puts "#{ description } - MEMORY USAGE(MB): #{ mb.round }"
end

def print_usage_before_and_after
  print_usage("Before")
  yield
  print_usage("After")
end

def random_name
  (0...20).map { (97 + rand(26)).chr }.join
end

And the array builder:

# build_arrays.rb
require_relative "./common"

ARRAY_SIZE = 1_000_000

times = ARGV.first.to_i

print_usage(0)
(1..times).each do |n|
  foo = []
  ARRAY_SIZE.times { foo << {some: "stuff"} }

  print_usage(n)
end

The get_process_mem gem is just a convenient way to get the memory being used by the current Ruby process. What we see is the same behavior that was described above, a continual increase in memory usage.

$ ruby build_arrays.rb 10
0 - MEMORY USAGE(MB): 17
1 - MEMORY USAGE(MB): 330
2 - MEMORY USAGE(MB): 481
3 - MEMORY USAGE(MB): 492
4 - MEMORY USAGE(MB): 559
5 - MEMORY USAGE(MB): 584
6 - MEMORY USAGE(MB): 588
7 - MEMORY USAGE(MB): 591
8 - MEMORY USAGE(MB): 603
9 - MEMORY USAGE(MB): 613
10 - MEMORY USAGE(MB): 621

However, if we run more iterations, we’ll eventually plateau.

$ ruby build_arrays.rb 40
0 - MEMORY USAGE(MB): 9
1 - MEMORY USAGE(MB): 323
...
32 - MEMORY USAGE(MB): 700
33 - MEMORY USAGE(MB): 699
34 - MEMORY USAGE(MB): 698
35 - MEMORY USAGE(MB): 698
36 - MEMORY USAGE(MB): 696
37 - MEMORY USAGE(MB): 696
38 - MEMORY USAGE(MB): 696
39 - MEMORY USAGE(MB): 701
40 - MEMORY USAGE(MB): 697

Hitting this plateau is the hallmark of not being an actual memory leak, or that the memory leak is so small that it’s not visible compared to the rest of the memory usage. What may not be intuitive is why memory usage continues to grow after the first iteration. After all, it built a big array, but then promptly discarded it and started building a new one of the same size. Can’t it just use the space freed up by the previous array? The answer, which explains our problem, is no. Aside from tuning the garbage collector, you don’t have control over when it runs, and what we’re seeing in the build_arrays.rb example is new memory allocations being made prior to garbage collection of our old, discarded objects.

Do not panic if you see a sudden rise in the memory usage of your app. Apps can run out of memory for all sorts of reasons – not just memory leaks.

I should point out that this isn’t some sort of horrible memory management issue specific to Ruby, but is generally applicable to garbage-collected languages. Just to reassure myself of this, I reproduced essentially the same example with Go and saw similar results. However, there are Ruby libraries that make it easy to create this sort of memory issue.

Divide and Conquer

So if we need to work with large chunks of data, are we doomed to just throw lots of RAM at our problem? Thankfully, that’s not the case. If we take the build_arrays.rb example and decrease the array size, we’ll see a decrease in the point where memory usage plateaus that’s roughly proportional to the array size.

This means that if we can break our work into smaller pieces to process and avoid having too many objects existing at one time, we can dramatically reduce the memory footprint. Unfortunately, that often means taking nice, clean code and turning it into more code that does the same thing, just in a more memory-efficient way.

Isolating Memory Usage Hotspots

In a real codebase, the source of a memory issue will likely not be as obvious as in the build_arrays.rbexample. Isolating a memory issue before trying to actually dig in and fix it is essential because it’s easy to make incorrect assumptions about what’s causing the problem.

I generally use two approaches, often in combination, to track down memory issues: leaving the code intact and wrapping a profiler around it, and monitoring memory usage of the process while disabling/enabling different parts of the code I suspect could be problematic. I’ll be using memory_profiler here for profiling, butruby-prof is another popular option, and derailed_benchmarks has some great Rails-specific capabilities.

Here’s some code that’ll use a bunch of memory, where it may not be immediately clear which step is pushing up memory usage the most:

# people.rb
require_relative "./common"

def run(number)
  Person.delete_all

  names = number.times.map { random_name }

  names.each do |name|
    Person.create(name: name)
  end

  records = Person.all.to_a

  File.open("people.txt", "w") { |out| out << records.to_json }
end

Using get_process_mem, we can quickly verify that it does use a lot of memory when there are a lot of Person records being created.

# before_and_after.rb
require_relative "./people"

print_usage_before_and_after do
  run(ARGV.shift.to_i)
end

Result:

$ ruby before_and_after.rb 10000
Before - MEMORY USAGE(MB): 37
After - MEMORY USAGE(MB): 96

Looking through the code, there are multiple steps that seem like good candidates for using a lot of memory: building a big array of strings, calling #to_a on an Active Record relation to make a big array of Active Record objects (not a great idea, but done for demonstration purposes), and serializing the array of Active Record objects.

We can then profile this code to see where memory allocations are happening:

# profile.rb
require "memory_profiler"
require_relative "./people"

report = MemoryProfiler.report do
  run(1000)
end
report.pretty_print(to_file: "profile.txt")

Note that the number being fed to run here is 1/10 of the previous example, since the profiler itself uses a lot of memory, and can actually lead to memory exhaustion when profiling code that already causes high memory usage.

The results file is rather lengthy and includes memory and object allocation and retention at the gem, file, and location levels. There’s a wealth of information to explore, but here are a couple of interesting snippets:

allocated memory by gem
-----------------------------------
  17520444  activerecord-4.2.6
   7305511  activesupport-4.2.6
   2551797  activemodel-4.2.6
   2171660  arel-6.0.3
   2002249  sqlite3-1.3.11

...

allocated memory by file
-----------------------------------
   2840000  /Users/bruz/.rvm/gems/ruby-2.2.4/gems/activesupport-4.2.6/lib/activ
e_support/hash_with_indifferent_access.rb
   2006169  /Users/bruz/.rvm/gems/ruby-2.2.4/gems/activerecord-4.2.6/lib/active
_record/type/time_value.rb
   2001914  /Users/bruz/code/mem_test/people.rb
   1655493  /Users/bruz/.rvm/gems/ruby-2.2.4/gems/activerecord-4.2.6/lib/active
_record/connection_adapters/sqlite3_adapter.rb
   1628392  /Users/bruz/.rvm/gems/ruby-2.2.4/gems/activesupport-4.2.6/lib/activ
e_support/json/encoding.rb

We see the most allocations happening inside Active Record, which would seem to point at either instantiating all the objects in the records array, or serialization with #to_json. Next, we can test our memory usage without the profiler while disabling these suspects. We can’t disable retrieving records and still be able to do the serialization step, so let’s try disabling serialization first.

  # File.open("people.txt", "w") { |out| out << records.to_json }

Result:

$ ruby before_and_after.rb 10000
Before: 36 MB
After: 47 MB

That does indeed seem to be where most of the memory is going, with before/after memory delta dropping 81% by skipping it. We can also see what happens if we stop forcing the big array of records to be created.

  # records = Person.all.to_a
  records = Person.all

  # File.open("people.txt", "w") { |out| out << records.to_json }

Result:

$ ruby before_and_after.rb 10000
Before: 36 MB
After: 40 MB

This reduces memory usage as well, although it’s an order of magnitude less reduction than disabling serialization. So at this point, we know our biggest culprits, and can make a decision about what to optimize based on this data.

Although the example here was contrived, the approaches are generally applicable. Profiler results may not point you at the exact spot in your code where the problem lies, and can also be misinterpreted, so it’s a good idea to follow up by looking at actual memory usage while turning sections of code on and off. Next, we’ll look at some common cases where memory usage becomes an issue and how to optimize them.

Deserialization

A common source of memory issues is deserializing large amounts of data from XML, JSON or some other data serialization format. Using methods like JSON.parse or Active Support’s Hash.from_xml is incredibly convenient, but when the data you’re loading is large, the resulting data structure that’s loaded in memory can be enormous.

If you have control over the source of the data, you can do things to limit the amount of data you’re receiving, like adding filtering or pagination support. But if it’s an external source or one you can’t control, another option is to use a streaming deserializer. For XML, Ox is one option, and for JSON yajl-ruby appears to operate similarly, although I don’t have much experience with it.

Just because you have limited memory doesn’t mean you cannot parse large XML or JSON documents safely. Streaming deserializers allow you to incrementally extract whatever you need from these documents and still keep the memory footprint low.

Here’s an example of parsing a 1.7MB XML file, using Hash#from_xml.

# parse_with_from_xml.rb
require_relative "./common"

print_usage_before_and_after do
  # From http://www.cs.washington.edu/research/xmldatasets/data/mondial/mondial-3.0.xml
  file = File.open(File.expand_path("../mondial-3.0.xml", __FILE__))
  hash = Hash.from_xml(file)["mondial"]["continent"]
  puts hash.map { |c| c["name"] }.join(", ")
end

$ ruby parse_with_from_xml.rb
Before - MEMORY USAGE(MB): 37
Europe, Asia, America, Australia/Oceania, Africa
After - MEMORY USAGE(MB): 164

111MB for a 1.7MB file! This clearly is not going to scale up well. Here’s the streaming parser version.

# parse_with_ox.rb
require_relative "./common"
require "ox"

class Handler < ::Ox::Sax
  def initialize(&block)
    @yield_to = block
  end

  def start_element(name)
    case name
    when :continent
      @in_continent = true
    end
  end

  def end_element(name)
    case name
    when :continent
      @yield_to.call(@name) if @name
      @in_continent = false
      @name = nil
    end
  end

  def attr(name, value)
    case name
    when :name
      @name = value if @in_continent
    end
  end
end

print_usage_before_and_after do
  # From http://www.cs.washington.edu/research/xmldatasets/data/mondial/mondial-3.0.xml
  file = File.open(File.expand_path("../mondial-3.0.xml", __FILE__))
  continents = []
  handler = Handler.new do |continent|
    continents << continent
  end
  Ox.sax_parse(handler, file)

  puts continents.join(", ")
end

$ ruby parse_with_ox.rb
Before - MEMORY USAGE(MB): 37
Europe, Asia, America, Australia/Oceania, Africa
After - MEMORY USAGE(MB): 37

This brings us down to a negligible memory increase and should be able to handle vastly larger files. However, the tradeoff is that we now have 28 lines of handler code we didn’t need before, which seems like it’d be error prone, and for production use it should have some tests around it.

Serialization

As we saw in the section about isolating memory usage hotspots, serialization can have high memory costs. Here’s the key part of people.rb from earlier.

# to_json.rb
require_relative "./common"

print_usage_before_and_after do
  File.open("people.txt", "w") { |out| out << Person.all.to_json }
end

Running this with 100,000 records in the database, we get:

$ ruby to_json.rb
Before: 36 MB
After: 505 MB

The issue with calling #to_json here is that it instantiates an object for every record, and then encodes to JSON. Generating the JSON record-by-record so that only one record object would need to exist at a time reduces the memory usage significantly. None of the popular Ruby JSON libraries appear to handle this, but a commonly recommended approach is to build the JSON string manually. There is a json-write-stream gem that provides a nice API for doing this, and converting our example to this looks like:

# json_stream.rb
require_relative "./common"
require "json-write-stream"

print_usage_before_and_after do
  file = File.open("people.txt", "w")
  JsonWriteStream.from_stream(file) do |writer|
    writer.write_object do |obj_writer|
      obj_writer.write_array("people") do |arr_writer|
        Person.find_each do |people|
          arr_writer.write_element people.as_json
        end
      end
    end
  end
end

Once again, we see optimization has given us more code, but the result seems worth it:

$ ruby json_stream.rb
Before: 36 MB
After: 56 MB

Being Lazy

A great feature added to Ruby starting with 2.0 is the ability to make enumerators lazy. This is great for improving memory usage when chaining methods on an enumerator. Let’s start with some code that isn’t lazy:

# not_lazy.rb
require_relative "./common"

number = ARGV.shift.to_i

print_usage_before_and_after do
  names = number.times
                .map { random_name }
                .map { |name| name.capitalize }
                .map { |name| "#{ name } Jr." }
                .select { |name| name[0] == "X" }
                .to_a
end

Result:

$ ruby not_lazy.rb 1_000_000
Before: 36 MB
After: 546 MB

What happens here is that at each step in the chain, it iterates over every element in the enumerator, producing an array that has the subsequent method in the chain invoked on it, and so forth. Let’s see what happens when we make this lazy, which just requires adding a call to lazy on the enumerator we get from times:

# lazy.rb
require_relative "./common"

number = ARGV.shift.to_i

print_usage_before_and_after do
  names = number.times.lazy
                .map { random_name }
                .map { |name| name.capitalize }
                .map { |name| "#{ name } Jr." }
                .select { |name| name[0] == "X" }
                .to_a
end

Result:

$ ruby lazy.rb 1_000_000
Before: 36 MB
After: 52 MB

Finally, an example that gives us a huge memory usage win, without adding a lot of extra code! Note that if we didn’t need to accumulate any results at the end, for instance, if each item was saved to the database and could then be forgotten, there would be even less memory usage. To make a lazy enumerable evaluate at the end of the chain, just add a final call to force.

Another thing to note about the example is that the chain starts with a call to times prior to lazy, which uses very little memory since it just returns an enumerator that will generate an integer each time it’s invoked. So if an enumerable can be used instead of a big array at the beginning of the chain, that will help.

Keeping everything in huge arrays and maps is convenient, but in real world scenarios, you rarely need to do that.

One real-world application of building an enumerable to lazily feed into some sort of processing pipeline is processing paginated data. So rather than requesting all the pages and putting them into one big array, they could be exposed through an enumerator that nicely hides all the pagination details. This could look something like:

def records
  Enumerator.new do |yielder|
    has_more = true
    page = 1

    while has_more
      response = fetch(page)
      response.records.each { |record| yielder record }

      page += 1
      has_more = response.has_more
    end
  end
end

Conclusion

We’ve done some characterization of memory usage in Ruby, and looked at some general tools for tracking down memory issues, as well as some common cases and ways to improve them. The common cases we explored are by no means comprehensive and are highly biased by the sort of issues I personally have encountered. However, the biggest gain may just be getting in the mindset of thinking about how the code will impact memory usage.

This article was written by Bruz Marzolf, a Toptal Ruby developer.

Celebrating 25 Years of Linux Kernel Development

September 9, 2016 8:22 am / Leave a comment

Linux is now 25 years old, but it’s no hipster. It’s not chasing around Pokemon, and it’s not moving back in with its parents due to crippling student debt. In fact, Linux is still growing and evolving, but the core ideas of the Linux State of Mind remain the same.

You see, Linux is much more than an operating system, it’s a mindset. Even if you don’t agree with its philosophy, you can’t afford to ignore it.

That’s why we decided to pay homage to this iconic operating system and the ever-growing community of developers who keep it going.

25 years of Linux: Honoring the great penguin coup

To mark the occasion, the Linux Foundation recently published the seventh edition of its Linux Kernel Development Report, which offers a detailed recap of all the work done over the past couple of decades. The adoption of Git, 10 years ago, made tracking easier (not that we’re looking for exact numbers here). It’s estimated that more than 14,000 developers have invested time and effort in Linux kernel development since 2005. This army of talent comes from more than 1,300 companies, and the report lists a number of industry heavyweights as the main sponsors of Linux kernel development: Intel, Samsung, Red Hat, AMD, Google, ARM, Texas Instruments and more.

While it’s the epitome of open-source, Linux kernel development is not a hobby. Not anymore. So, as we wish Linux a happy birthday, let’s take a quick look at some kernel development highlights:

25 years of development
Contributions from 14,000 developers since 2005
5,000 new developers joined the effort in the past 30 months
~22 million lines of code currently constitute the Linux Kernel
More than 4,500 lines of new code added each day
Development is speeding up

Linux State of Mind

When it was first released in August 1991, few could have imagined the long-term impact of Linus Torvalds’ open-source OS on the software industry. At the time, the tech landscape was dominated by a handful of big players, the likes of Microsoft, Apple, and IBM. The nineties were an era of rapid technological progress, and new technologies – most notably the Internet – made remote, distributed development a possibility.

Developers halfway around the globe could finally collaborate on immensely complex software projects. It goes without saying that Toptal, and indeed every freelancer, owes a debt of gratitude to Linux pioneers who validated the concept of remote software development in an era of dial-up internet. They made it work, without Git, Skype, broadband, and a bunch of other technologies and tools we take for granted today. In fact, most of these tools were in part made possible by Linux-based servers and many are open-source.

But what drove the industry to adopt Linux? Well, to put it bluntly, the simple fact of not being Microsoft was a big part of it. A lot of UNIX people just had an issue with proprietary operating systems and wanted an open-source alternative. Diehards couldn’t reconcile with the fact that mainstream operating systems were a proprietary walled garden. Their vision was to create an open-source alternative, something that everyone could use free of charge, something they could modify and redistribute at will.

Idealism and business rarely cross paths, but when they do, we often end up with novel ideas backed by passionate proponents and criticized by equally passionate detractors. The idea of an open-source software ecosystem is as powerful today as it was in the early nineties, and with a quarter century of Linux development behind us, we can get a better idea of its profound impact on industry.

Open-Sourcing and Democratising The Internet

But wait, most of us are reading this on non-Linux systems: Windows and Mac rigs, smartphones and tablets running UNIX-like operating systems, so why aren’t we on Linux systems? Well, we are, at least sort of. How many LAMP servers sprung into action today, to serve you your daily dose of emails, social feed updates, useless ads and (mis)information?

Personally, I think this is the biggest contribution to mankind made by the Linux community: Linux-based servers helped our industry take off and legitimized the open-source concept.

It was no longer about UNIX enthusiasts trying to create an open-source alternative to fight The Empire; Linux took on big brands on their home turf and emerged victorious. The concept was vindicated and mainstreamed, proving once and for all that open-source isn’t just a heartwarming notion; It’s good for business.

What did we get out of it?

Linux helped lower the bar for developers and entrepreneurs entering the industry. Successful Linux distros grabbed a sizeable market share in the hosting industry, generating pressure on competing platforms. In this war of attrition, Linux servers prevailed thanks to a number of factors. In the end, they came to dominate many market segments. Today, anyone can get a reasonably powerful hosting plan for peanuts, and if they’re looking for the cheapest possible solution, they’re bound to end up with a flavor of Linux. The rest of the stack is usually as free and open as Linux itself.

That’s what our side of the industry got out of Linux: The ability to quickly deploy products on low-cost, open-source infrastructure.

How many pet projects, started on the cheap, turned into multi-billion enterprises? How many would have failed had it not been for Linux?

Where’s the Money Linuxowski?

A common misconception about Linux development is that it’s handled solely by enthusiasts and that it’s not a niche for people looking to cash in. While Linux is a labor of love, it’s also big business in its own way.

As I highlighted earlier, development is speeding up, and more Linux developers from more companies are choosing to contribute. They’re not simply choosing to set aside their precious time because they are good Linux folk; the latest report states that the number of unpaid developers working on the kernel has dropped to 7.7 percent, dipping into single-digit territory for the first time.

While some might not agree, I see this is a very positive trend. Enthusiasm doesn’t pay bills, and it’s hard to keep any project going on enthusiasm alone for more than a few years, let alone a gargantuan project like Linux that came into being a generation ago.

It doesn’t end there. According to numerous surveys, demand for Linux talent remains robust, and is actually increasing, and so is the Linux server market share. A few years ago, it would have been much easier to tally up the number of shipped servers, motherboards, and other hardware, and figure out the number of Linux boxes in the wild.

This is no longer the case.

Linux in The Cloud

A dark Cloud came along and made this process more difficult, much to the dismay of analysts. When your job is to look at numbers and market trends, any lack of data or ambiguity is bad for business, and for a while analysts expressed concerns about the future of Linux in the post-cloud era. These concerns made a lot of sense (and, to some extent, still do) because the cloud ecosystem was an oligopoly from the get-go, dominated by the Amazons and Googles of the world.

Does the Cloud spell doom for cheap Linux servers and is there a silver lining?

The Cloud did not kill off small Linux servers, but it hasn’t been kind to them either:

At one end of the spectrum, you’ll find people who believe the cloud will transform the server market, and through consolidation, will forever change the hosting industry. This economy of scale argument is tempting because it’s logical to assume cloud industry leaders will offer superior pricing by virtue of their size. You don’t get sweetheart hardware deals if you have a small, regional datacenter and need a couple of hundred fresh boxes every year; you get them if you have a massive cloud infrastructure and need dozens of new servers on a weekly basis. However, I find this argument overly simplistic.
The opposing camp espouses equally simplistic views, but it tends to be more optimistic. A lot of Linux veterans have high hopes for cloud development; they believe CloudStack and OpenStack will help turn the tide, and they think there will always be room for smaller players.

As usual, the truth is somewhere in the middle, but let’s not weigh in on this; it’s beyond the scope of this article. Suffice it to say that both options could work for Linux in the long run. Even if the hosting industry is forever transformed and consolidated, that doesn’t mean demand for Linux talent will evaporate. On the contrary, it’s likely to increase regardless of what happens, although demand will evolve to meet new requirements.

The Next 25 Years

What do the next 25 years have in store for Linux?

It’s hard to say, but I have a feeling Linux isn’t going anywhere, at least not in the foreseeable future:

The server industry is evolving, but it’s been doing so forever. Linux has a habit of seizing server market share, although the cloud could transform the industry in ways we’re just beginning to realize. Either way, Linux servers aren’t going anywhere just yet.
Linux still has a relatively low market share in consumer markets, dwarfed by Windows and OS X. This will not change anytime soon.
Linux does not have a significant share in mobile, although Android currently dominates this space. Mobile is becoming an Android/iOS duopoly. It’s oversaturated; there are too many software and hardware platforms out there, so it’s doubtful Linux will ever take off in this market.
Gaming is a potentially huge, untapped market for Linux. This market is dominated by Windows in the desktop segment, proprietary operating systems in the console space, and Android and iOS in mobile. Valve’s SteamOS is the latest attempt to get Linux on gaming rigs, and it’s a promising concept. Unfortunately, demand for Steam Machines has been soft and Linux still has a negligible market share in the gaming industry.
Emerging segments include the Internet of Things (IoT), wearables, smart home devices, and more. Due to its open-source nature and the potential for a very small OS footprint, Linux-based operating systems could find their way into a range of connected devices, from our homes and cars to our places of business.
High-performance computing has a good chance of becoming a Linux-only space. Linux has practically replaced UNIX and other operating systems in current-generation supercomputers.

It’s hard to make Linux-related predictions due to the nature of the OS and the Linux community. Evolution doesn’t necessarily have to be a straight line, and Linux developers have proven this time and again. Linux could morph into something completely different over the next couple of decades and become the OS of choice for various products and services we can’t even imagine today.

This article was written by Nermin Hajdarbegovic, a Toptal Technical Editor.

11 Essential Linux Interview Questions

September 9, 2016 8:20 am / Leave a comment

1. How would you swap the stdout and stderr of a command?

$ command 3>&2 2>&1 1>&3

To swap stdout and stderr of a command, a third file descriptor is being created (in this case 3), which is assigned to the same target that stderr is pointed to (referenced by &2). Then stderr is pointed to the same target stdout is pointed to (&1). Finally,stdout is pointed back to where the newly created file descriptor is pointed (which is the same target stderr originally pointed to.)

2. How would you count every occurrence of the term “potato” in all the files appearing under the current directory, and its subdirectories, recursively?

$ grep -orI potato . | wc -l

To list every occurrence of the term “potato” on a separate line, one must run grep -o potato <path>. Adding the r flag to the command makes the search recursively process every file under the given path, and the I flag ensures that matches in binary files are ignored. In addition, the w flag can be included to match the exact term only, and ignore superstrings such as “potatoes”, and to make the search case-insensitive, the i flag can be added as well:

$ grep -iworI potato . | wc -l

The number of lines yielded by this grep command is the number of occurrences of the desired term, which can then be counted by piping it into the wc -l command.

3. How would you write a shell script that prints all the additional arguments passed to it in reverse order?

for (( i = ${#}; i > 0; i-- )); do
        echo ${!i}
done

The arguments are available as $<n>, where n is the position of the argument. For example, $0 would give the name of the script, $1 would give the first additional argument, $2 the second, and so on. The total number of additional arguments is found in $#.

A loop that starts at $# and ends at 1 can be used to print each additional argument in reverse order.

4. How would you write a shell script and ensure that only one instance of the script may run for every user? Strong atomicity is not required.

In Bash:

LOCKFILE=/tmp/lock-`whoami`
if [ -e ${LOCKFILE} ] && kill -0 `cat ${LOCKFILE}`; then
    echo "Already running!"
    exit 1
fi
trap "rm -f ${LOCKFILE}; exit" INT TERM EXIT
echo $$ > ${LOCKFILE}

Start by determining a name for the lock file. In this case, the lock file is generated by suffixing a common name with the username of the current user.

Then, check if the lock file exists and if the PID contained within the lock file is running. If it is, exit with a message.

Create a trap to remove the lock file on a clean exit, or unclean exits (any exit with the signal INT or TERM).

Finally, if the script has not exited yet, create the lock file, and store the PID of the current process ($$) in it.

5. What are shared, slave, private, and unbindable mountpoints?

A mount point that is shared may be replicated as many times as needed, and each copy will continue to be the exact same. Other mount points that appear under a shared mount point in some subdirectory will appear in all the other replicated mount points as it is.

A slave mount point is similar to a shared mount point with the small exception that the “sharing” of mount point information happens in one direction. A mount point that is slave will only receive mount and unmount events. Anything that is mounted under this replicated mount point will not move towards the original mount point.

A private mount point is exactly what the name implies: private. Mount points that appear under a private mount point will not be shown elsewhere in the other replicated mount points unless they are explicitly mounted there as well.

An unbindable mount point, which by definition is also private, cannot be replicated elsewhere through the use of the bind flag of the mount system call or command.

6. What are some basic measures that you would take to harden a server’s SSH service?

There are a some very simple steps that can be taken to initially harden the SSH service, such as:

Forcing the service to use only version 2 of the protocol will introduce both security and feature enhancement.
Disabling root login, and even password-based logins, will further reinforce the security of the server.
The whitelist approach can be taken, where only the users that belong to a certain list can login via SSH to the server.
Disabling password-based login will require you to then allow key based logins, which is secure, but can be taken further by restricting their use from only certain IP addresses.
Changing the port to something other than 22 significantly decreases random brute force attempts from the internet.

Sometimes the use of having an SSH service on a server may just be transferring files to and from the server (typically using tools like scp). In such a case, it is possible to change the shell of the user to something restrictive, such as rssh.

Finally it is often desirable to know exactly what is going on while you are not logged into the server. The logging verbosity may be increased if needed. Often, it is the logs that allow one to figure out if a key has indeed been stolen and is being abused.

7. What is a Unix shell? Is Bash the only Unix shell?

A Unix shell is a software that provides a user interface for the underlying operating system. Unix shells typically provide a textual user interface – a command line interpreter – that may be used for entering and running commands, or create scripts that run a series of commands and can be used to express more advanced behavior.

Bash is not the only Unix shell, but just one of many. Short for Bourne-Again Shell, it is also one of the many Bourne-compatible shells. However, Bash is arguably one of the most popular shells around. There are other, modern shells available that often retain backwards compatibility with Bash but provide more functionality and features, such as the Z Shell (zsh).

8. Where is the target path of a symlink stored? How are permission settings for symlinks handled?

The target path of a symlink is stored in an inode – the data structure used to store file information on disk.

Typically, the permission settings of the symlink itself only control the renaming and removal operations performed on the symlink itself. Any operation that deals with the contents of the file linked to are controlled by the permission settings of the target file.

9. What are terminal multiplexers? What are some of their key features? What are some of the more popular ones currently available?

Terminal multiplexers enable several terminals to be created and controlled from a single screen or from a single remote session. The terminals and sessions can be detached and left running, even with the user logging off.

Two of the more common ones available today are GNU Screen and tmux.

Screen enables you to connect to multiple remote servers without needing to open multiple terminal shells. Work can be preserved and a session detached, for example, to wait for the output of a long-running command. On subsequent reconnection, users can reattach to existing sessions or run new sessions. Sessions can also be shared among different users, which may be useful in audit or training scenarios.

Both Screen and tmux support split-screen functionality (to be more precise, tmux supports this and Screen supports it via a plugin). This allows, for example, runningtail on a service’s log file in one part of the screen, and editing the configuration of that service, and restarting it if necessary, in another.

10. What would be a simple way to continuously monitor the log file for a service that is running?

Probably the simplest and most common way to do this would be by using the command:

tail -F $LOGFILE

where $LOGFILE is an environment variable corresponding to the path to the log file to be monitored.

By default, the Linux tail command prints the last 10 lines of a given file to standard output. The -F option causes additional file content to be displayed in realtime as the file continues to grow. This yields a simple mechanism for monitoring services via their log files in close to realtime.

Two other specific command line options of interest in this context are:

The -s option causes tail to sleep for a specified number of seconds between updates (e.g., tail -F -s 10 will update the displayed file contents roughly every 10 seconds rather than in close to realtime as the file is updated).
The -n option can be used to specify a number of lines other than 10 to initially display (e.g., tail -n 20 -F will first display the last 20 lines of the file and will then continue updating the output in realtime).

11. What is a Linux null (or Blackhole) route? How can it be used to mitigate unwanted incoming connections?

A Linux null (or Blackhole) route is a type of routing table entry which, upon matching a packet, discards it without forwarding the packet any further or sending any ICMP.

Using this technique, it is possible to block an IP (or range of IP addresses) by running a simple command. For example, blocking 192.168.0.1 can simply be done with the following command:

# ip route add blackhole 192.168.0.1/32

This article is from Toptal.