Inquiry About Limitation Of File System

Home » CentOS » Inquiry About Limitation Of File System

November 3, 2018 yf chu CentOS 8 Comments

I have a website with millions of pages. We often deploy our websites on CentOS with Nginx and Apache Http Server as HTTP Web Server. I am very worried about the performance of web server if the amount of pages is very very large. I wonder whether the performance will be affected if there are too many files and directories on the server. Besides, our bugdet is limited. We do not want to deploy too many Servers and use other file system such as GFS. So I want to know the limitation of CentOS file system and how should I handle this issue.

8 thoughts on - Inquiry About Limitation Of File System

Walter H. says:

November 3, 2018 at 3:04 am

does ‘millions of pages’ also mean ‘millions of files on the file system’?

just a hint – has nothing to do with any file system as its universal:
e.g. when you have 10000 files don’t store them in one folder, create 100 folders with 100 files in each;

there is no file system that handles millions of files in one folder or with limited resources (e.g. RAM)
yf chu says:

November 3, 2018 at 3:16 am

Thank you for your hint. I really mean I am planning to store millions of files on the file system. Then may I ask that what is the maximum number of files which could be stored in one directory without affecting the performance of web server?

At 2018-11-03 16:03:56, “Walter H.” wrote:
Stephen John says:

November 3, 2018 at 9:40 am

There is no simple answer to that. It will depend on everything from the physical drives used, the hardware that connects the motherboard to the drives, the size of the cache and type of CPU on the system, any low level filesystem items (software/hardware raid, type of raid, redundancy of the raid, etc), the type of the file system, the size of the files, the layout of directory structure, and the metadata connected to those files and needing to be checked.

Any one of those can severely affect partially performance of the web-server, and multiple combinations of them can severely affect it. This means a lot of benchmarking for the hardware and os are needed to get an idea if any of the tuning of number of files per directory will make things better or not. I have seen many systems where the hardware worked better with a certain type of RAID and it didn’t matter if you had 10,000 or 100 files in each directory.. the changes in performance were minimal but moving from RAID10 to RAID6 or vice versa sped things up much more.. or adding more cache to the hardware controller etc etc.

Assuming you have tuned all of that, then the number of files in the directory comes down to a ‘gut’ check. I have seen some people do some sort of power of 2 per directory but rarely go over 1024. if you do a
3 level double hex tree <[0-f][0-f]>/<[0-f][0-f]>/<[0-f][0-f]>/ and lay them out using some sort of file hash method.. you can easily sit
256 files in each directory and have 2^32 files.. You will probably end up with some hot spots depending on the hash method so it would be good to test that first.
Jonathan Billings says:

November 3, 2018 at 10:35 am

There are hard limits in each file system.

For ext4, there is no per-directory limit, but there is an upper limit of total files (inodes really) per file system: 2^32 – 1 (4,294,967,295). XFS also has no per-directory limit, and a 2^64 limit of inodes. (18,446,744,073,709,551,616)

If you are using ext2 or 3 I think the limit per directory is around 10,000, and you start seeing heavy performance issues beyond that. Don’t use them. Now, filesystem limits aside, software that try to read those directories with huge numbers of files are going to have performance issues. I/O operations, memory limitations and time are going to be bottlenecks to web operations. You really need to reconsider how you want to serve these pages.
—
Jonathan Billings
yf chu says:

November 3, 2018 at 5:38 pm

Thank you for your advice. I know the issue depends on a lot of factors. Would you please give me some detail information about how to tune these parameters such as the size of cache,the type of cpu? I am not quite familiar with these detail.

At 2018-11-03 22:39:55, “Stephen John Smoogen” wrote:
Frank Cox says:

November 3, 2018 at 5:50 pm

Depending on the nature of these “millions of files”, you may want to consider a database-backed application rather than simply dumping the files into a directory tree of some kind. I assume that you’ll have to index your files in some way to make all of these web pages useful, so a database might be what you want instead of a simple heap o’ html files.

Then you won’t be dealing with millions of files. A properly constructed database can be very efficient; a lot of very smart people have put a lot of thought into making it so.
Keith Keller says:

November 3, 2018 at 6:08 pm

Just to be pedantic, it’s only what Jonathan suggested that would be a performance problem. Typically, a web server doesn’t need to read the directory in order to retrieve a file and send it back to a client, so that wouldn’t necessarily be a performance issue. But having too many files in one directory would impact other operations that might be important, like backups, finding files, or most other bulk file operations, which would also have an effect on other processes like the web server. (And if the web server is generating directory listings on the fly that would be a huge performance problem.)

And as others have mentioned, this issue isn’t filesystem-specific. There are ways to work around some of these issues, but in general it’s better to avoid them in the first place.

The typical ways of working around this issue are storing the files in a hashed directory tree, and storing the files as blobs in a database. There are lots of tools to help either job.

–keith
Gordon Messmer says:

November 3, 2018 at 7:40 pm

With XFS on modern CentOS systems, you probably don’t need to worry:
https://www.youtube.com/watch?v=FegjLbCnoBw

For older systems, as best I understand it: As the directory tree grows, the answer to your question depends on how many entries are in the directories, how deep the directory structure is, and how random the access pattern is. Ultimately, you want to minimize the number of individual disk reads required.

Directories with lots of entries is one situation where you may see performance degrade. Typically around the time the directory grows larger than the maximum size of the direct block list [1] (48k), reading the directory starts to take a little longer. After the maximum size of the single indirect block list (4MB), it will tend to get slower again.
File names impact directory size, so average filename length factors in, as well as the number of files.

A given file lookup will need to reach each of the parent directories to locate the next item in the path. If your path is very deep, then your directories are likely to be smaller on average, but you’re increasing the number of lookups required for parent directories to reduce the length of the block list. It might make your worst-case better, but your best-case is probably worse.

The system’s cache means that accessing a few files in a large structure is not as expensive as random files in a large structure. If you have a large structure, but users tend to access mostly the same files at any given time, then the system won’t be reading the disk for every lookup.
If accesses aren’t random, then structure size becomes less important.

Hashed name directory structure has been mentioned, and those can be useful if you have a very large number of objects to store, and they all have the same permission set. A hashed name structure typically requires that you store in a database a map between the original names
(that users see) and the names’ hashes. You could hash each name at lookup, but that doesn’t give you a good mechanism for dealing with collisions. Hashed name directory structures typically have a worse best-case performance due to the lookup, but they offer predictable and even growth for lookup times for each file. Where a free-form directory structure might have a large difference between the best-case and worst-case lookup, a hashed name directory structure should be roughly the same access time for all files.

1: https://en.wikipedia.org/wiki/Inode_pointer_structure

Inquiry About Limitation Of File System

8 thoughts on - Inquiry About Limitation Of File System

Recommended

Recent Posts

Recent Comments

Archives

Categories

Meta