The above code used extra vector<bool> used to track elements usage, this could be avoided with in-place approach. This will save much used memory from stack.

And we even can iterate index which larger than current index:

for (int i = begin; i < num.size(); i++) { dfs(num, begin + 1, now); }

So the final code is more simpler and with a better performance:

for (int i = begin; i < num.size(); i++) { // in-place current element swap(num[begin], num[i]); dfs(num, begin + 1, now); // reset back swap(num[begin], num[i]); } } };