Tracking Down a File Desciptor Leak in AWS Lambda

Back in last year August we started getting an odd error from the couch-backup lambda. Out of nowhere we started getting error messages like the below.

1
2
3
4
5
6
7
8
{
"message": "EMFILE: too many open files, open '/tmp/couchbackup/couchbackup-f..._2022-08-22T17:30:33.642Z.log'",
"stack": "Error: EMFILE: too many open files, open '/tmp/couchbackup/couchbackup-f..._2022-08-22T17:30:33.642Z.log'",
"errno": -24,
"code": "EMFILE",
"syscall": "open",
"path": "/tmp/couchbackup/couchbackup-f..._2022-08-22T17:30:33.642Z.log"
}

Looking through our code we don’t directly open any files or sockets which would cause this type of issue.
We could see that file descriptors where being left open using Lambda Insights.
before fix

So, we Googled to see if anyone else had this issue and discovered this blog. It suggested adding a node package called wtfnode wtfnode - npm (npmjs.com). Unfortunately, this didn’t seem to work, the author of the package does say that this package is using undocumented Node apis and could break without warning.

After a few more weeks of Googling, I came across this package samswen/lambda-emfiles: solve node AWS lambda EMFILE issue (github.com). This package handles the EMFILE issue a different way, it causes the Lambda process to exit with a status code of 1. This means that lambda throws away the current lambda environment and starts a new one. While it didn’t fix the issue it meant that we didn’t get the retries when a process ran out of file descriptors.
I did a little poking around in the source code of this package to see how it was working out the number of open file descriptors. What I found was

1
2
3
4
5
6
7
8
9
10
11
12
return new Promise((resolve) => {
exec(`ls /proc/${process.pid}/fd`, (err, stdout, stderr) => {
if (err || stderr) {
this.is_ok = false;
resolve(this.is_ok);
} else {
const parts = stdout.split("\n");
this.emfiles_count = parts.length - 1;
resolve(this.is_ok);
}
});
});

The thing that supprised me was that the Lambda run time has access to the /proc virtual directory. I guess that makes sense as it is running on Linux.

Having found this I had a play around on my machine to see what information you could get. I found that ls -l returned

1
2
3
4
5
6
7
8
9
10
11
lr-x------ 1 sbx_user1051 990 64 Nov 29 17:46 0 -> /dev/null
l-wx------ 1 sbx_user1051 990 64 Nov 29 17:46 1 -> pipe:[3890]
lrwx------ 1 sbx_user1051 990 64 Nov 29 17:46 10 -> anon_inode:[eventpoll]
lrwx------ 1 sbx_user1051 990 64 Nov 29 17:47 100 -> /tmp/tmp-33-k3p5q3R0D4Fi
lrwx------ 1 sbx_user1051 990 64 Nov 29 17:47 101 -> /tmp/tmp-33-7uD5r8I2FL2J
lrwx------ 1 sbx_user1051 990 64 Nov 29 17:47 102 -> /tmp/tmp-33-x5ZwfUpaOOSp
lrwx------ 1 sbx_user1051 990 64 Nov 29 17:47 103 -> socket:[13171]
lrwx------ 1 sbx_user1051 990 64 Nov 29 17:47 104 -> /tmp/tmp-33-JNd7W44OhPj8
lrwx------ 1 sbx_user1051 990 64 Nov 29 17:47 106 -> socket:[13173]
lrwx------ 1 sbx_user1051 990 64 Nov 29 17:47 108 -> /tmp/tmp-33-I7NKSkv4kw8P
lrwx------ 1 sbx_user1051 990 64 Nov 29 17:47 109 -> /tmp/tmp-33-qymMaVIo1Zga

After finding this we added some more logging to the lambda to record this before the process finished and what we found was the number of tmp files kept increasing. We now had out smoking gun.

Now that we knew what we were looking for I had a look to check that our code wan’t creating temp files, and we weren’t.

Then I started having a look at the Cloudant couchbackup library that we use. The first thing I noticed was in their package.json they referenced a package tmp, so it looks like they use tmp files. Next, I looked for references to that package and found

1
2
3
4
5
6
7
8
9
10
function apiDefaults() {
return {
parallelism: 5,
bufferSize: 500,
requestTimeout: 120000,
log: tmp.fileSync().name,
resume: false,
mode: "full",
};
}

Ok, so firstly does this get called when we use their backup function and secondaly what does it do? More digging and I found it is aliased app.js and it is called in the backup function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
backup: function(srcUrl, targetStream, opts, callback) {
const listenerErrorIndicator = { errored: false };
if (typeof callback === 'undefined' && typeof opts === 'function') {
callback = opts;
opts = {};
}
if (!validateArgs(srcUrl, opts, callback)) {
// bad args, bail
return;
}

// if there is an error writing to the stream, call the completion
// callback with the error set
addEventListener(listenerErrorIndicator, targetStream, 'error', function(err) {
debug('Error ' + JSON.stringify(err));
if (callback) callback(err);
});

opts = Object.assign({}, defaults(), opts);

Right, so it is called in the path we are using what does it do? Next, I dig into the tmp package.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/**
* Synchronous version of file.
*
* @param {Options} options
* @returns {FileSyncObject} object consists of name, fd and removeCallback
* @throws {Error} if cannot create a file
*/
function fileSync(options) {
const args = _parseArguments(options),
opts = args[0];

const discardOrDetachDescriptor =
opts.discardDescriptor || opts.detachDescriptor;
const name = tmpNameSync(opts);
var fd = fs.openSync(name, CREATE_FLAGS, opts.mode || FILE_MODE);
/* istanbul ignore else */
if (opts.discardDescriptor) {
fs.closeSync(fd);
fd = undefined;
}

return {
name: name,
fd: fd,
removeCallback: _prepareTmpFileRemoveCallback(
name,
discardOrDetachDescriptor ? -1 : fd,
opts,
true
),
};
}

The first thing I noticed was on line 15 it is opening a file and then on like 18 is closes the file only if you pass in opts.discardDescriptor === true. That looks like the issue!!

Lastly before raising an issue with Cloudant I wanted to test it, so in our couch-backup lambda I went into the node_modules folder, found the Cloudant code and changed tmp.fileSync().name to tmp.fileSync({discardDescriptor : true}).name and push those changes to production.

The next day this was the graph from Lambda Insights
after fix

Yay!!! We have found the problem.

The last thing was raising an issue with Cloudant File descriptor leak · Issue #535 · cloudant/couchbackup (github.com) and they are implementing what I suggested fix file descriptor leak in backup by emlaver · Pull Request #536 · cloudant/couchbackup (github.com)