How to solve C/C++ memory leaking (in Linux)?

My hobby project Med is written in C++. A lot of implementations need to use dynamic memory allocation and instantiation. Because complex data is impractical to be passed by value, like the case of JavaScript object and array. Since C++ doesn’t have garbage collection, it is possible that the developer doesn’t free/delete the dynamically created memory properly.

As in my project Med, the program will retrieve the memory from other process. That means, it needs to make a copy of the scanned memory. And this will involve creating dynamic memory (using new operator). When using the program to filter the memory to get the target result, it needs to get a new copy of memory with the updated values, then compare with the previous copy. The unmatched will need to be discarded (free/delete); the matched will need to replace the old one (also free/delete the old one, because the new one is dynamically allocated).

Though it can be easily expressed in verbal form, writing the algorithm is usually inducing some human mistakes, such as using wrong variable name. As a result, we may access the memory that is not dynamically allocated, or freeing the memory twice. These will cause segmentation fault, and C++ compiler will not tell you where the error comes from. Debugging is hellish in this case. As a result, I used several solutions to fix memory leaking.

Adopt TDD

I have covered this in my previous post. Smaller function is easier to be tested. Just make sure your functions or methods are testable.

Valgrind (Linux only)

Compile your application with debugging information (g++ with -g option). Then run your test suite with valgrind.

valgrind --leak-check=yes ./testSnapshot

It will tell you which function access the non-allocated memory, how many bytes are untracked, and so on.

Valgrind is super useful to check memory leaking.

Memory Manager

Because I wrote the code with some mistakes, so the memory is freed twice and causes the error. But I failed to find where is the cause (which finally I found it).

Therefore, I wrote a memory manager to make sure the memory will not be freed more than once. However, this is actually not a good solution to avoid memory leaking. It just prevents the program to free the memory twice.

ByteManager::ByteManager() {}

ByteManager::~ByteManager() {
  clear();
}

// Singleton
ByteManager& ByteManager::getInstance() {
  static ByteManager instance;
  return instance;
}

Byte* ByteManager::newByte(int size) {
  Byte* byte = new Byte[size];
  recordedBytes[byte] = byte;
  return byte;
}

void ByteManager::deleteByte(Byte* byte) {
  auto found = recordedBytes.find(byte);
  if (found != recordedBytes.end()) {
    delete[] recordedBytes[byte];
    recordedBytes.erase(found);
  }
}

void ByteManager::clear() {
  for (auto it = recordedBytes.begin(); it != recordedBytes.end(); ++it) {
    delete[] it->second;
  }
  recordedBytes.clear();
}

map ByteManager::getRecordedBytes() {
  return recordedBytes;
}

Then, I refactored both new operator and delete operator to be replaced by newByte() and deleteByte() methods. And I use a std::map to store all the created memory.

Smart pointer

C++11 introduced smart pointers shared_ptr and unique_ptr. unique_ptr can be replaced by auto_ptr before C++11. By using smart pointer, we can use new operator to instantiate an object, then we need not to delete it. Because it will be deleted automatically when it lost the reference. For example,

typedef std::shared_ptr<SnapshotScan> SnapshotScanPtr;

// some where else
SnapshotScanPtr matched = SnapshotScanPtr(new SnapshotScan(curr.getAddress() + i + currOffset, scanType));

So, program will instantiate a new SnapshotScan object as the shared_ptr. Then it will be stored in a std::vector. When it lost the reference, such as removed from the std::vector, it will be deleted automatically.

In my opinion, it is a better solution than the Memory Manager above. However, my existing project can be hardly refactored to use smart pointer.

C++ future

Recently updating my hobby project Med, memory editor for Linux, still under heavy development with various bugs.

In this project, I use several C++1x features (compiled with C++14 standard). Most recent notable feature is multi-threading scanning. In memory scanning, scan through the accessible memory blocks sequentially is slow. Therefore, I need to scan the memory blocks in parallel. To implement this, I have to create multiple threads to scan through the memory blocks.

How many threads I need? I make it into variable n, default to 4. Meaning, when scanning is started, the n threads will start scanning asynchronously. When one of the thread finish scanning, the next (n+1) thread will start scanning the next (n+1) memory block, until the end.

I design the solution top-down, and implement it bottom-up. In order to design the solution for the requirement above, I created a ThreadManager (header file here). So the ThreadManager basically will let me queue the tasks that I am going to launch in parallel with n threads. After queuing all the tasks, I just need to start, then they will run in parallel as multi-threading. This is what ThreadManager is doing. If mutex is needed, it is the task need to handle with, not the ThreadManager to handle. ThreadManager just make sure the tasks are run in parallel.

This is the simple test that uses the ThreadManager.

Technically, there are several important C++ standard libraries used, vector, functional, future, mutex, and condition_variable. Vector is an STL that allows me to store a list of items as vector (just like array or list).

Since C++11, it supports lambda expression. Then using functional, I can use std::function template to create any function object.

std::function<void()> fn = []() {
  for (int i = 0; i < 4; i++) {
    this_thread::sleep_for(chrono::milliseconds(300));
    cout << "thread1: " << i << endl;
  }
};

The code above initialize a variable fn which stores an anonymous function. Previously, this can be done using callback function, which makes the code difficult to manage. By using std::function and std::vector, I can store all the anonymous functions to the vector.

Future is a very interesting library. If we are familiar with JavaScript promise or C# async, then it is similar to these (futures and promises). Whenever a task is start, it will return a future. Because we don’t know when the task will be ended. We can also do something like using a loop to check the condition of a task whether is ended, but this will be over complicated. Because future will let you handle what should be done when the task is ended.

Using future, I need not to create thread directly (though it is called ThreadManager). I can use async function to run the callback function asynchronously. It is the async that returns future. And this async function allows lambda expression as function argument. Great C++11.

C++11 supports mutex (mutual exclusion) and condition variable. Mutex can prevent race condition. When we are using multi-threading, most of the time the threads are using some shared resource. Read the empty data may crash the program. Therefore, we need to make sure when reading or writing, the resource is not accessible by other threads. This can be done by locking the mutex, so that other threads cannot continue. Then after the operation, unlock the mutex, and the other threads can lock the mutex and continue. Hence, only a single thread can access the resource.

Condition variable is used together with mutex. We can use condition variable to wait if a condition is fulfilled. When a wait is performed, the mutex will be locked (by unique lock). As a result, the thread will be blocked and cannot continue. The thread will wait until the condition variable notifies the thread to perform a condition check. If it is fulfilled, then the mutex will be unlocked and the thread will continue.

In ThreadManager, my previous code uses a loop to check the condition, if the condition doesn’t allows to run the next thread, then it will sleep a while and check again. This method is wasting the CPU resources. Because it keeps checking the condition. By using condition variable and mutex, I can just stop the thread, until it is notified to continue.

Yeah. Modern C++ is cool!

Closure

In JavaScript, we can create closure like this,

var foo = (() => {
  let x = 0;
  return () => {
    return x++;
  }
})();

for (var i = 0; i < 10; i++) {
  var x = foo();
  console.log(x);
}

But translating above code into C++, it cannot work as expected for the variable, unless the variable is outside the lambda function.

// In a function
int x; // variable here
auto foo = [&x]() {
  x = 0;
  return [&x]() {
    return x++;
  };
}();
for (int i = 0; i < 10; i++) {
  int x = foo();
  cout << x << endl;
}

This is because C++ variable can be only accessed in the function scope. After the function return the value, the variable is not accessible anymore.

Similarly, Python can also return the function, but it cannot work as closure like JavaScript. However, Python can create non-primitive variable such as object or list so that the variable will be accessed by reference.

def _foo():
    x = [0]
    def increment():
        x[0] += 1
        return x[0]
    return increment

foo = _foo()
for i in range(10):
    x = foo()
    print(x)

C++ revisit

I liked C programming, as it is low level and the compiler is widely available. Using C language can demonstrate the understanding of pointer and memory. You can implement a linked list or a hash table by using C. So that you can understand better how the linked list and hash table work.

But as long as you want to build some end user applications, C language is never a good choice. Choosing a language for our product is important. We do not develop a web application using low level programming language. It is totally impractical.

C and C++

Do not think that C language is an old language. It can survive until today because it has some powerful features (yet I cannot name them out, because I am not a C developer). There are new features on C99 and C11. But practically to solve higher level problem like game development, C++ is much more better than C. (But there is a better choice like Lua.)

Syntax

Why is C++ better than C? From the syntax level, it is easier to read and easier to write. For example,

strtolower(trim(myString));

where myString is a C string, then trim() is a function that returns trimmed string, strtolower() is another function that returns the string with lower case. (The function names I used here is based on PHP.) So, we can see, when we read or write, we need to read from right to left: trim then strtolower. And the parentheses needs to be matched properly.

On the other hand for C++,

myString->trim()->strtolower();

Let’s say myString is a customized object, namely MyString, that manipulates the string. So, the syntax is much more intuitive. myString calls the trim() method and returns the MyString object, so that we can continue with strtolower() that returns MyString object. This is called method chaining.

Object oriented

Object oriented is a wondrous invention! It is so powerful that most programming languages today support object oriented. Inheritance, polymorphoism, and interface, these features reduce a lot of tremendous works. You can implement object oriented in C (using library such as GObject), but C language is not designed for object oriented.

Because of syntax and the object oriented, I prefer C++ than C.

String

Another advantage of C++ is the string. C string is an array of character, with the fixed size. It is not dynamic as C++ string. For example, you want to perform strcat(), you need to have a larger buffer to store the strcat() result. And if you want to do it with dynamic memory allocation, then you have to free the memory manually. This is exhaustive.

Object method

Related to method chaining, in C++ we can write the methods. In C, because it is using struct, to implement something like method is troublesome. For example,

#include <stdio.h>
#include <stdlib.h>

typedef struct {
    int a;
    int (*b)(int a);
} MyType;

int myTypeMethod(int a) {
    return a+20;
}

MyType* myTypeNew() {
    MyType* myType = (MyType*)malloc(sizeof(MyType));
    myType->a = 20;
    myType->b = myTypeMethod; //There is no method in struct, so we assign the callback function
    return myType;
}

void myTypeDelete(MyType** myType) {
    free(*myType);
    *myType = NULL;
}

int main(int argc, char** argv) {
    MyType* myType = myTypeNew();
    printf("%d\n", myType->a);
    printf("%d\n", myType->b(20));
    myTypeDelete(&myType);
    return 0;
}

Because C doesn’t have namespace, but we can do something like namespace by using prefix.

C++11, C++14, C++1z

If you are not C++ lover, you may think that C++ is an archaic programming language. Just like HTML and JavaScript, there are ongoing proposals for the new features for C++. That is why, C++ is not dying yet. There are some notable features I need to state here.

New syntax

“auto” data type

#include <iostream>
#include <typeinfo>
using namespace std;

int main(int argc, char** argv) {
  int a = 10;
  auto b = a;
  cout << b << endl;
  cout << typeid(b).name() << endl; //"i" as integer
  float c = 20.0f;
  b = c;
  cout << b << endl;
  cout << typeid(b).name() << endl; //Still "i"
  return 0;
}

“auto” is just like “var” in C#.

foreach

Most modern programming languages allows foreach keyword, like JavaScript, PHP, and even Python. Though Python uses “for” keyword, it works actually like foreach in PHP.

#include <iostream>
using namespace std;
int main(int argc, char** argv) {
  int a[] = { 4, 5, 3, 2, 1 };
  for(auto item : a) {
    cout << item << endl;
  }
  return 0;
}

Another foreach,

#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;

int main(int argc, char** argv) {
  vector<int> a{ 4, 5, 3, 2, 1 };
  for_each(a.begin(), a.end(), [](auto &item) {
      cout << item << endl;
    });
  return 0;
}

The above code can only be compiled with c++14 standard. If using c++11 standard, we must use “int” instead of “auto”.

Lambda expression

Lambda expression, a.k.a anonymous function. If you are working with JavaScript, such as

var a = [ 4, 5, 3, 2, 1 ];
var found = a.find(function(item) {
    if(item == 2)
        return item;
});

So, the above example shows that the parameter of the find() is a function, without a function name. This is a common syntax in functional programming. Functional programming is a trend of most modern programming language. The anonymous function is also supported by PHP. The advantage of functional programming is the elimination of side effects and avoiding changing state of the data. Therefore, it is a good paradigm for parallel programming.

Therefore, looking at the example above, there is some syntax like,

  ([](int b) {
    cout << b << endl;
  })(12);

Which is similar to JavaScript,

(function(b) {
    console.log(b);
})(12);

In C++, the squared bracket is used to capture the variable of the scope. (I think it is similar to PHP “using” keyword.)

Previously, we can use the callback function like,

#include <iostream>
using namespace std;

int foo(int (*cb)(int, void*), int size, void* data) {
  return cb(size, data);
}

int callback(int size, void* data) {
  int sum = 0;
  int* intData = (int*)data; //typecast
  for(int i=0;i<size;i++) {
    sum += intData[i];
    cout << intData[i] << endl;
  }
  return sum;
}

int main(int argc, char** argv) {
  int a[] = { 4, 5, 3, 2, 1 };
  int sum = foo(callback, 5, a);
  cout << sum << endl;
  return 0;
}

Since we can use anonymous function, we need not to create a callback function like above. This can be written as,

#include <iostream>
using namespace std;

int foo(int (*cb)(int, void*), int size, void* data) {
  return cb(size, data);
}

int main(int argc, char** argv) {
  int a[] = { 4, 5, 3, 2, 1 };
  int result = foo([](int size, void* data) {
      int sum = 0;
      int* intData = (int*)data;
      for(int i=0;i<size;i++) {
        sum+= intData[i];
        cout << intData[i] << endl;
      }
      return sum;
    }, 5, a);
  cout << result << endl;
  return 0;
}

Multi-threading, asynchronous programming, promise

We can implement multi-threading using other libraries like pthread. But with C++11, you can create thread without third party libraries, perform sleep, sleep by milliseconds or even nanoseconds. There is also library for mutex (mutually exclusive).

C++11 also supports asynchronous programming like C# and promise like JavaScript. With the lambda expression, the syntax shares some similarities with C# and JavaScript. With async, we can do something at the background (using thread), and use the result later. With promise, you can promise to return some value, else throw an exception.

Regular expression

Regular expression is so important for text processing. C++11 supports regular expression. There are different regular expression syntax: BRE (basic regular expression) is the default syntax used by “grep” command, ERE (extended regular expression, and Perl regular expression. Perl regular expression is very powerful. If you use Python, JavaScript, Ruby, and C#, you will find that they have different syntax.

However, interestingly C++ uses ECMAScript regular expression syntax by default (though we can choose BRE, ERE, or awk). That means, if you know JavaScript RegExp, then you will have no problem with C++ regex.

(I see the converging features of the different programming languages, which means these features are useful, intuitive, and influential.)

Garbage collection

So far, there is no good garbage collection for C++.

Another thing I have to state. Because of the networking and big data, parallel programming is a demand and the trend. To make sure the data is immutable, functional programming becomes handy.