Wednesday, December 10, 2008

Disk data - a summary

Here is a summary about Disk Data tables in MySQL Cluster.

Data node configuration:
  • SharedGlobalMemory=384M
    RAM from the shared global memory is used for the UNDO_BUFFER when you create the log file group.
    In the configuration generated by severalnines.com/config then you have to uncomment the SharedGlobalMemory in mysqlcluster-XYZ/cluster/config/config.ini before you start the cluster.
  • DiskPageBufferMemory=3072MB
    If you are relying a lot on the disk data, we recommend to set this to as much as possible.
    In the configuration generated by severalnines.com/config then you have to uncomment the DiskPageBufferMemory in mysqlcluster-63/cluster/config/config.ini before you start the cluster.
    The DiskPageBufferMemory should be set to:
    DiskPageBufferMemory=TOTAL_RAM - 1GB (OS) - 1200MB (approx size used for buffers etc in the data nodes) - DataMemory - IndexMemory
    Expect to do some trial and terror before getting this correct.
Creating the storage:
  • LOGFILE GROUP
    CREATE LOGFILE GROUP lg ADD UNDOFILE 'undo1.dat' INITIAL_SIZE=20G UNDO_BUFFER_SIZE=128M ENGINE=NDB;

    You can only have one LOGFILE GROUP. 20GB undo log is just an example, but don't be shy with it if you have space. Theoretically, you could need as much UNDO as you have TABLE SPACE size. If you have 10GB table space, then start with an UNDO log that is 10% of that. Add more UNDO LOG FILES if you need, i.e, run into problems. UNDO_BUFFER_SIZE=128M. should be sufficient for most applications (but you must have set SharedMemoryGlobal=384M). This buffer cannot grow dynamically or be extended later :(
  • UNDO LOG FILES
    You can add more undo files (online) by doing:
    ALTER LOGFILE GROUP lg ADD UNDOFILE 'undo_2.dat' INITIAL_SIZE 256M ENGINE NDB;
  • TABLESPACE
    CREATE TABLESPACE ts_1 ADD DATAFILE 'data1.dat' USE LOGFILE GROUP lg INITIAL_SIZE=256M ENGINE=NDB;
  • DATAFILES
    In 7.0 you can use DiskIOThreadPool which makes it possible to have X threads operating on one file, so you don't have to create many data files. Of course for some applications it may be good to have the more files if you want to spread out over several physical disks.
    In 6.3 it is recommended to have many but smaller data files. Here is a script that generates SQL (ALTER TABLE ADD ... ) to create number of datafiles. Run it as:
    sh ts-gen.sh 100 > alter_add_ts.sql
    and you will get a file with 99 SQL statements to add table spaces.
    From the mysql client you can then do:
    mysql> source alter_add_ts.sql
    and a lot of datafiles will be added to the ts_1 tablespace.
    Please note that you can change filesize and path if you wish.
  • Having more than 50-60 data files does not really give anything. The reason for having several data files instead of one is that more internal file system threads can operate in "parallel" on the data files, instead of having to be synchronized on one.
  • You grow the table space by adding data files.
  • Adding data files is an online operation if you wish to extend your table space.
  • There is no auto-extend yet of table spaces.
  • You can create many tablespaces, each having many data files.
  • Leave EXTENT_SIZE as it is. We have found no reason yet to change it.
  • Example CREATE TABLE:
    CREATE TABLE t1 ( a INTEGER PRIMARY KEY, b VARBINARY(2048)) engine=ndb TABLESPACE ts_1 STORAGE DISK;
    The column 'b' will be stored on disk. The column 'a' will be stored in DataMemory together with the Ordered Index on 'a' (which comes from the PRIMARY KEY). The PRIMARY KEY hash index on 'a' will use IndexMemory.

Disk space and disk utilization:
  • Indexed attributes are stored in RAM (DataMemory). However, you can specify for the non-indexed if they should be in RAM or DISK when you create the table (see ref manual, about column definition STORAGE DISK or STORAGE MEMORY).
  • In MySQL Cluster 6.2 , 6.3, and 7.0 disk data attributes are always stored as fixed size (e.g a VARCHAR(256) will always use 256B on disk, even if it is NULL - this will change in 7.1, hopefully).
  • BLOB/TEXTs are special and stored in segments. The first 256B will be stored in RAM, the rest will be stored in segments (depending if it is a MEDIUMBLOB/BLOB/LONGBLOB then the segment size varies (from 2048B up to 8192B iirc).
  • NULLs stored in a BLOB/TEXT will occupy 0B on disk, but 2B overhead of DataMemory.
  • The data files are divided into extents. Each extent is 1024K (1MB). There is no reason to change the extent size. Many records may reside in one extent.
  • If you write something to a disk data table, then the data will be written in as many table spaces as you have NoOfReplicas set to. Thus, typically you will have two copies of the data (one on each node in a node group).
  • The information_schema contains information how much of the table space and undo files that are used. E.g
    select free_extents, total_extents from information_schema.files where file_type='datafile'

Short on disk data internals :

Thanks jonas for your input on this!
  • The DiskPageBufferMemory is a LIRS cache (variant of the LRU cache)
  • Dirty pages are flushed to the table space during an LCP (local checkpoint), but also if they have been inactive for some time (i have no info yet on the inactive time)
  • Changes to Disk data is Redo logged (same ways as with in-memory tables).
  • Changes to Disk data is written to a circular UNDO log (using WAL). This is so that we can undo all the way back to the start of the last completed LCP. This way, we can get to a state where memory and disk are in sync (i.e. references between disk data and in-memory storage are correct) before we start running the REDO log.
  • If a record exists in the DiskPageBufferMemory, it will be returned from there to the application.
  • If a record does not exists in the DiskPageBufferMemory, it will have to be fetched from disk. In case there are no empty pages in the DiskPageBufferMemory, the LRU algorithm will sync dirty pages to disk in order to free up empty pages.
Performance

What to use disk data for?:
  • Storing e.g inactive subscribers in a subscriber database (when activated they you can move them into RAM). If an activated user has been inactive for too long it can be moved back to the disk data storage. This is something you have to write in the application logic.
  • Storing e.g SMS/MMS
  • Logging data
  • Large feeds coming in that you need to persist, but not necessarily access in real-time (i.e<3ms).>
  • Probably more..
Suitable access patterns:
  • You should not use data stored in disk data tables to realtime applications. If you do random io, then you applications will be as fast as the disk.
  • Use simple access patterns - avoid joins on disk data. See it as an extension to RAM where you store data that you don't have to have that frequent access to.
  • If you want frequent access to the data you should have a BIG DiskPageBufferMemory (like in innodb, the innodb_buffer_pool is very important to performance). It is the same here.
  • Range searches will be expensive as disk data is not stored according to the order of a particular index. It is more or less random where data is stored. I will get back on numbers of this.
  • If you use SSD for disk data, then the above might be more or less true. Disk data has not been tested with SSDs yet.
Caveats:
  • Check my earlier blog post for caveats regarding deallocation/allocation of extents.
  • If you restart the cluster with --intial you have to manually delete all the data files and undo log files that was created. These are not cleared even if you start the cluster with --initial.

29 comments:

Unknown said...

Hi Johan,

I was searching on disk based tables in MySQL cluster and came across your blog.

You have mentioned that-

"In MySQL Cluster 6.2 and 6.3 disk data attributes are always stored as fixed size (e.g a VARCHAR(256) will always use 256B on disk, even if it is NULL - this will change in 6.4)"

so the current version of mysql cluster 7.0 can store variable sized data on disk and thus saving a lot of disk space?

Thanks,
Tarandeep

Unknown said...

Very useful blog.
I have a question about the DiskPageBufferMemory.

I have 8GB RAM and in the config.ini (created with configurator), I have
DataMemory=5400M
IndexMemory=675M

If I apply this
DiskPageBufferMemory=TOTAL_RAM - 1GB (OS) - 1200MB (approx size used for buffers etc in the data nodes) - DataMemory - IndexMemory

the result is pratically equal to 0.

Could you please explain this situation?
Thank you very much.
Regards.

Johan Andersson said...

@tarandeep - unfortunately, it did not make it into 7.0, so it is coming later. Presumably 7.1. Thanks for pointing this out. I will update the blog post.

Johan Andersson said...

@antonio - If you are using all of the DataMemory (5400MB), and are relying on disk data, then you should get more RAM in the box. The DiskPageBufferMemory is very important for read/write performance, and having a big one helps a lot (if you rely on disk data a lot).
If you haven't used all the DataMemory and cannot buy more RAM, then you should "move" DataMemory -> DiskPageBufferMemory.

Unknown said...

Hi Johan,
to successfully restart the node I had to decrease DataMemory=5400M to DataMemory=4000M and DiskPageBufferMemory=3072M to DiskPageBufferMemory=1400M.
If DiskPageBufferMemory is really so important, probably I'll "move" another 1GB to it, also because is easier for me to monitor Data usage and Index usage, instead of DiskPageBufferMemory.
Thanks,
Regards.

Antonio

Unknown said...

Hi,

I am trying to store huge amount of data (around TB) in mysql cluster 7.0.6

My data is actually a key,value pair. Initially I was thinking of using disk based tables with BLOB data type for the value. However, BLOB uses 256 bytes in memory and remaining is stored on disk.

So to store 300M records, I would need- 256*300*1000000 bytes = 72G
Plus due to replication factor (=2), the RAM requirement is doubled = 154G

I have 4 data nodes (each having 32G) but as computed above, I would need aound 43G of RAM. And additional RAM would be required for internal buffers, INDEX memory etc.

So is there any alternative to BLOB, so mysql does not use 256 bytes per record?

I would like to just have my primary key columns to be in memory and everything on disk. I am not that worried about fetching and serving the data from the disk. I am more worried about if it is possible to use MySql cluster for this kind of job or not.

Thanks,
Tarandeep

Johan Andersson said...

Hi Tarandeep!

How big are your BLOBs?

If the BLOBs are small (< ~8000B) then you can do like this:

1) The maximum record size in cluster is 8052B (BLOB/TEXTs are not in this limit, expcept the first 256B).

2) If the PK is a BIGINT, it will cost 8B (storage size of a BIGINT).

Then you can use 8052B - 8B = 8044B that you can store for the blob data.

So, if your BLOBs are <=8044B, then you can use a VARBINARY instead of a BLOB.

BR
johan

Unknown said...

Thanks Johan for your reply!
My blobs are bigger than 8000 bytes. So I guess for the time being I have to live with this limitation of MySql Cluster.

When can we expect next release of MySql Cluster?

Thanks,
Tarandeep

Matthew Montgomery said...

Johan,

How did you come up with that 10% of tablespace size for logfile size? Is it just an educated guess?

Ronald said...

i have 4 nodes/sql each have 8gb of ram, i get Node 2 to 5: Data usage increased to 94%(2409 32K pages of total 2560)

is this right?

i have set 5154M DataMemory
128M SharedGlobalMemory
64M DiskPageBuffer
427M IndexMemory

is this right?

Johan Andersson said...

Hi Ronald,

did you do a rolling restart after increasing the datamemory in config.ini?

It seems that you are still on the default DataMemory which is 80MB.


BR
johan

Rajnish Kumar said...

Hi All,
i wanna configure all mysql clustering data i.e ndb tables store on disk , is it possible ? , so that we can store all data into disk .so that when i restart the cluster we can recover the data.

Johan Andersson said...

Mr Kumar,

It is possible.
All tables in MySQL Cluster are checkpointed and redo logged. This means you can restart cluster and the data is there.

You can also have disk data tables, if the data set is larger than the amount of RAM. Then you should create a tablespace as outlined in the blog post.

BR
johan

Unknown said...

Hi Johan,

I have question regarding Disk Based Custer.

We have 2 Data Nodes with 24 GB RAM on each 2x4 Core Processor.

I want to know that can we use this 2 Data Nodes with Data Disk Cluster for 60 GB Database. If yes, can you please provide me config.ini for the same. I will be very thankful to you.

Thanks
Mohammed Munazir

Johan Andersson said...

Hi Mr Munazir,

Please contact me on johan@severalnines.com and we can sort this out.


Thanks,
johan

Amit Gera said...

Hi John,
Please help me out to understand issue listed below:
i have mysql cluster 7.1.18 version and i have 2 data node
i am using datafile to store data on disk but when i fire command all report memory on ndb_mgm client
then output is like data usage is 50 % on one datanode and 5% on second datanode??????why there is such difference?

Amit Gera said...

Hi John,
Please help me out to understand issue listed below:
i have mysql cluster 7.1.18 version and i have 2 data node
i am using datafile to store data on disk but when i fire command all report memory on ndb_mgm client
then output is like data usage is 50 % on one datanode and 5% on second datanode??????why there is such difference?

Johan Andersson said...

Hi Amit,
Sorry missed this comment. DId you resolve it?
I would check with
ndb_desc -p
and check how the tables are being partitioned.

DId you use any partitioning function for the tables?

-j

burak said...

Hi,

I have 48GB of RAM for the data node. Despite the following parameters in the config.ini it gets swapped to about 11GB.

[ndbd_default]
REDO LOG # # # Params
Fragment size = 256M log
Init fragment log files = SPARSE
NoOfFragmentLogFiles = 152
RedoBuffer = 128M
RedoOverCommitCounter = 3
RedoOverCommitLimit = 20

[ndbd]
Data Memory = 27648M
IndexMemory = 9216M
Backup memory = 1024M

Global Shared Memory = 384M
Disk Page Buffer Memory = 4096M

Anonymous said...

Hi Johan, I had asked this question in mysql cluster forums but didn't see any reply. However got to your blog, perhaps you could provide me some advice.

I'm asked to setup MySQL Cluster for an application with read vs. write ratio of 80:20. Set-up should be highly available and read intensive.

We have the CGE of mysql cluster with 6 data nodes, planning to setup in 2 node groups with 3 replicas for higher redundancy and 2 Mysql nodes which will house the management nodes as well. Each data nodes comes with 96G memory and storage for all data nodes comes from SAN with 96 disks mounted via FC. We haven't done any RAID yet though it is obvious to setup RAID 1+0 but wanted to get some help and not overlook some thing very obvious.

I have already read the Evaluation Guide for 7.2GA release, in that High-End setup for disk based tables meets our setup, as we plan on housing table in both memory and disk.

As per the evaluation guide, i see mysql has recommended to house undolog, redolog/LCP, tablespaces(n) all in separate RAID sets. With 96 SAS 15K RPM disks available in Storage, I'm leaning towards creating following raid sets. Any comments or improvements is appreciated.

1. Undolog - 6+6 RAID 1+0
2. Redolog/LCP - 6+6 RAID 1+0
3. Tablespace1 - 6+6 RAID 1+0, x 6 (one raid set for each data node)

Thanks,
Stalin

nick said...

I have cluster 7.2.7 and i was shocked when my 1GB tablespace filled up way before it should have. Then I did the math and realized that disk data is STILL fixed size. I have a VARCHAR(3000) column that rarely reaches 500 characters but still needs to accommodate up to 3000 characters. This goes for many other fields. This is such a counterproductive way of storing data. When can we expect dynamic sized fields?

Johan Andersson said...

Hi Nick,
yes, it is quite sad. I don't know, but perhaps you can email cluster@lists.mysql.com and ask there.

Best regards,
Johan

Johan Andersson said...

Stalin,

That sounds like a good setup. Not sure what you app is doing really, but often you can get by with less amount of disks, but the separation of LCP/REDO UNDO TABLESPACE is fine.

Best regards,
j

Johan Andersson said...

burak,

what do you mean swapped?

Thanks,
Johan

Anonymous said...

Hello Johan,

I have the following doubt about disk data based tables in MySQL-Cluster 7.2.8 and I dont know if you could help me.

I got the following tables:

Table 1: event (all in memory)

id int PK
timestamp date INDEX
...


Table 2: data (some columns in memory, some of them on disk)

event_id int PK
payload varchar ST MEMORY
extra1 varchar ST DISK
extra2 varchar ST DISK
extra3 varchar ST DISK
...


When I launch the following query it lasts 20 seconds and is ok (pushed join@1):

SELECT SQL_CALC_FOUND_ROWS event.*,data.payload FROM event,data WHERE ( event.id=data.event_id AND (((data.payload LIKE '%foo%' OR data.payload LIKE '%666f6f%'))) ) AND ( timestamp >='2012-11-10 00:00:00' ) ORDER BY timestamp DESC LIMIT 0, 11;

If I change the resultset from "data.payload" to "data.*" to retrieve all columns from data table the query never ends, so I guess that before checking the where clause it retrieve all columns from disk from all rows that matches the join. I dont understand this behaviour because all columns involved in the where clause are stored in memory, and maybe these ones should be used before fetching disk columns that are only used in the resultset.

Is this the normal behaviour?

Thanks,

Unknown said...

Hello Johan,

I have the following doubt about disk data based tables in MySQL-Cluster 7.2.8 and I dont know if you could help me.

I got the following tables:

Table 1: event (all in memory)

id int PK
timestamp date INDEX
...


Table 2: data (some columns in memory, some of them on disk)

event_id int PK
payload varchar ST MEMORY
extra1 varchar ST DISK
extra2 varchar ST DISK
extra3 varchar ST DISK
...


When I launch the following query it lasts 20 seconds and is ok (pushed join@1):

SELECT SQL_CALC_FOUND_ROWS event.*,data.payload FROM event,data WHERE ( event.id=data.event_id AND (((data.payload LIKE '%foo%' OR data.payload LIKE '%666f6f%'))) ) AND ( timestamp >='2012-11-10 00:00:00' ) ORDER BY timestamp DESC LIMIT 0, 11;

If I change the result set from "data.payload" to "data.*" to retrieve all columns from data table the query never ends, so I guess that before checking the where clause it retrieve all columns from disk from all rows that matches the join. I dont understand this behaviour because all columns involved in the where clause are stored in memory, and maybe these ones should be used before fetching disk columns that are only used in the result set.

Is this the normal behaviour?

Thanks,

Johan Andersson said...

Hi
It sounds exceesive.

If you can paste the EXPLAIN of the query here for both cases would be great, or if you can email the EXPLAIN to johanseveralninescom then I can have a look.

How long time does the query take in case 1 and does it really never terminate in case2?


BR
Johan

Unknown said...

Hello,

thanks for your answer.

EXPLAIN is the same for both queries:

+----+-------------+------------+--------+-------------------+-----------+---------+-------------------------------+---------+---------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+-------------------+-----------+---------+-------------------------------+---------+---------------------------------------------------------------------------+
| 1 | SIMPLE | event | range | PRIMARY,timestamp | timestamp | 8 | NULL | 2035246 | Parent of 2 pushed join@1; Using where with pushed condition |
| 1 | SIMPLE | data | eq_ref | PRIMARY | PRIMARY | 16 | event.id | 1 | Child of 'event' in pushed join@1; Using where with pushed condition |
+----+-------------+------------+--------+-------------------+-----------+---------+-------------------------------+---------+---------------------------------------------------------------------------+
2 rows in set (0.01 sec)

With never ends I mean that the query is stopped with the message "ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction" after a lot of time (aprox. 40 min later). I tried using LEFT JOIN but the result is the same. The parameter DiskPageBuffer is set to 3GB.

Thanks,

Johan Andersson said...

If you project any non-indexed columns, then you have to hit the disk to fetch the data.

"data.payload" is stored in-memory.
"data.*" --> some columns are on disk.

So with data.*, and the fact that your query hits 2M records (according to the explain), then since data.* contains columns that are on disk, this is expected.

Better is to fist execute a query and make sure this query only hits in-memory, and then issue new queries to find the records you are really interested in (those that you get with the LIMIT).

Best regards
Johan