Very slow "Integrat...
 
Share:
Notifications
Clear all

7th December 2023:  added payment option Alipay to purchase Astro Pixel Processor from China, Hong Kong, Macau, Taiwan, Korea, Japan and other countries where Alipay is used.

Very slow "Integrating pixels" during Integration

23 Posts
4 Users
1 Likes
207 Views
(@digug)
White Dwarf
Joined: 2 years ago
Posts: 14
Topic starter  

APP 2.0.0-beta22

I have a 13900k and 64GB DDR5 memory on Windows 11.  I loaded 419 light frames.  Everything seems to run fine until it reaches the pixel integration step.  I have my data stored on a 10Gbe NAS, but I am only seeing around 380Mbps received in task manager, even though during previous steps when APP was reading the files it would exceed 1GBps.  My NAS is a Truenas server, so once files are read from it they are cached in RAM on the server which is how I can get such high transfer rates on a 10Gb link.  It is still running after about 2hrs.  I reset all settings to their defaults before starting the session this morning.  I don't think putting the data on a SSD will help much with this issue since APP is reading data from my NAS at less than 1Gpbs or ~120MBps.


   
ReplyQuote
(@vincent-mod)
Universe Admin
Joined: 6 years ago
Posts: 5708
 

Reading from an external source is likely not very efficient for APP, it's very disk intensive during processing and not really optimized to work over LAN.


   
ReplyQuote
(@mabula-admin)
Universe Admin
Joined: 7 years ago
Posts: 4120
 

Hi @digug,

Okay, I think you need to have a look at how much GBs the stack needs, 400+ lights will probably be many GB ? APP reports this at the top of the image viewer when integrating.

Then look at the actual cache size of your NAS. I suspect the cache size is much smaller?

APP will integrate pixel by pixel using pixelstacks (can't be done differently) and each time reads part of the images stored in the file mapper on your NAS in this case. So for each batch of pixels that APP will process, all images need to be consulted again and again.

So most likely the connection with your NAS is a bottleneck, but the main bottleneck will be the cache itself being too small and then the IO read/write speed of the HDDs in your NAS. I agree that using SSDs in your NAS will not help, because of the 2 other bottlenecks.

Mabula


   
ReplyQuote
(@digug)
White Dwarf
Joined: 2 years ago
Posts: 14
Topic starter  

@mabula-admin I forgot to load that 419 stack, instead loaded a different stack of 120 subs.  The space required up top says 55.6GB, my Truenas's ZFS cache has 54.6/64GB used.  I'm waiting to see how long it takes.  Kind of hard to believe 10Gbe would be a bottle neck considering it is capable of 1200MB/sec.  I have a 4x1TB stripe array of 10k RPM WD Raptors.  Perhaps the network protocols would be a bottleneck?  Obviously local SSD storage would be the best choice, but 4TB SSD's aren't exactly cheap.  Plus, considering the amount of writes generated by APP, I'd be wary of the lifespan of the SSD being shortened over time.  The newer SSDs are supposedly rated for 600TB written before failure, but again, they are quite expensive at the moment.  I will report back once this stack finishes.


   
ReplyQuote
(@digug)
White Dwarf
Joined: 2 years ago
Posts: 14
Topic starter  

It finished pretty quick, but it was only 120 subs.  I'll try the 419 sub stack later and time it.


   
ReplyQuote
(@mabula-admin)
Universe Admin
Joined: 7 years ago
Posts: 4120
 

Dear George,

Okay, that is some nice cache indeed, that is great.

I would suspect that if the stack size is bigger than your cache, the performance will start to degrade. Yes, 10Gb network is fast, but no match for local ssds with the newest interfaces of course. But it should work okay with decent speed I would think.

Another factor to consider is the way the 4x1TB stripe array is made on the NAS, normally it would have fail proof setup for at least 1 of the HDDs failing and this always reduces performance to keep your data safe.

Anyway, maybe you can share how the 419 sub stack went and how it compared to the 120 sub stack in time ?

Mabula

 


   
ReplyQuote
(@digug)
White Dwarf
Joined: 2 years ago
Posts: 14
Topic starter  

@mabula-admin I began the 417 (I mistakenly said 419 previously) stack at 6:17PM EST and it is still going 25% complete 369/417 integrated, as of posting this it is 6:58PM EST.  Disk space required 201.9 GB.  Truenas shows 52GB ZFS cache used.  As for the RAID type, I used a simple stripe for max performance, data integrity is not so much of an issue since I replicate nightly to an external drive.  Considering the fact that APP is definitely more accurate than DSS it doesn't surprise me that it takes much longer.  The same stack in DSS took about 17min.  In APP while integrating, I see an average of about 100W used by the CPU, meanwhile I've seen this CPU use more than 320W and maintain its turbo boost clocks (water cooled).  Is there room for APP to use more power to finish quicker?

 

Update:

42%, integrating pixels...7:55PM EST

 

Meanwhile, I've been researching how to set up a cache vdev in Truenas to see if this will help speed things up a bit.  I have a 512GB NVME m.2 drive in my NAS server that can be repurposed.

This post was modified 3 months ago 2 times by George

   
ReplyQuote
(@digug)
White Dwarf
Joined: 2 years ago
Posts: 14
Topic starter  

The stack finally finished at 8:51PM EST.


   
ReplyQuote
(@digug)
White Dwarf
Joined: 2 years ago
Posts: 14
Topic starter  

I added the cache drive, but it seems that APP still works at the same data rate as before during the integrating pixels step.  I've seen this before in other programs/software.  I've seen games load at around 500MBps from an SSD capable of 3000MBps.  I believe it is just the way software works.  Single threaded file transfers will use the full bandwidth of the bus, whereas complex calcs may not.  So even if I ran APP using a stack of files from the fastest PCIe NVME RAID array for example, this integrating pixel step may not work much faster.

image

   
ReplyQuote
(@mabula-admin)
Universe Admin
Joined: 7 years ago
Posts: 4120
 

Hi George,

Okay thank you for your testing and feedback. I actually think your stacking over the network is a latency issue then. APP makes quite a bit IO calls when doing the pixel stack integrations. So it reads pixel 0-1000 from each layer of the stack (which is each registered and normalized image), integrates those 1000 pixelstacks while reading the next 1000-2000 pixels etc... so for each 1000 pixels, new IO reads are needed.

If the data would be directly available from an SSD, this works really well, because SSD's have really low latency. The network interface latency probably is much slower, can you compare the numbers for latency of a SSD drive and a network connection?

Mabula


   
ReplyQuote
(@digug)
White Dwarf
Joined: 2 years ago
Posts: 14
Topic starter  

@mabula-admin Well what transfer rate are you seeing on your SSD during pixel integration?  Yesterday I was testing some of my flats, and only stacked 40 subs to make it go quick, and during the pixel integration, in task manager, I saw transfer rates in the 100s of Kbps.  It finished quickly because I'm guessing it had much less to process?


   
ReplyQuote
(@stastro)
Black Hole
Joined: 4 years ago
Posts: 183
 

What are you setting as your working directory?

I have a Local Optane SSD as my working directory (Drive X:) and all of my lights, master calibration frames are all on my NAS which is running on a 10G network, I do not have any issue with slow integrating pixels.

Integrating pixels is writing data to your working directory, so if your working directory is on your TrueNAS then this will be slow, 10G ethernet is no where near as fast as a local SSD for example.


   
ReplyQuote
(@digug)
White Dwarf
Joined: 2 years ago
Posts: 14
Topic starter  

@stastro I've seen both read and write during pixel integration.  What data rates do you see during the integrating pixels step?


   
ReplyQuote
(@stastro)
Black Hole
Joined: 4 years ago
Posts: 183
 

During the normalization etc, it is reading from the network and the most I peak to is around 3Gbps, during the Integration and creating pixels, I get around 100MB/sec write to disk

image

My Drive X (Working directory) is 2x 800GB P5800X's in a RAID0

The Integrating Pixels Task for 70xOIII frames takes just under 45 seconds to complete

image

And then it writes the completed stack to my NAS in a few seconds

This post was modified 2 months ago 2 times by Simon Todd

   
ReplyQuote
(@digug)
White Dwarf
Joined: 2 years ago
Posts: 14
Topic starter  

@stastro thanks for the benchmark and info.  Only 100MBps?  I'll try setting my working dir to a local folder on my main SSD to see if it helps and report back.


   
ReplyQuote
(@stastro)
Black Hole
Joined: 4 years ago
Posts: 183
 

@digug Basically this is what I do

1. All light frames located on NAS
2. All master calibration frames are located on the NAS
3. My Save location for the integration is located on the NAS
4. My Working directory is a local disk (SSD), this is really important as you want your working location to be as close to the CPU and Memory as possible, this is where the term Data Locality comes from 🙂

For My integration, I have 31xHa, 70xOIII and 29xSII frames to create the Master Lights

Started at 20:30

image

20:33:18 Step 4: Registration started
20:33:35 Step 5: Normalize Started
20:34:38 Step 6: Integration Started
20:35:16 Started creating Ha Master Light (31 frames)
20:37:14 Started creating OIII Master Light (70 frames)
20:38:56 Started creating SII Master Light (29 frames)
20:39:22 Finished

So in less that 10 minutes I have loaded all frames, Calibrated, Registered, Normalized, Integrated and created 3 Master frames from 138 frames of data, each frame is 62mpx of 119MB per frame, so I would say it's quick enough 😀

Simon

This post was modified 2 months ago by Simon Todd

   
ReplyQuote
(@digug)
White Dwarf
Joined: 2 years ago
Posts: 14
Topic starter  

@stastro Major improvement since I set the working dir to my local SSD!  I don't know why I thought that the working dir and all the light/cal frames had to be together.  This 235 frame stack finished from calibration to integration in 7 minutes!  Many thanks!

image
This post was modified 2 months ago by George

   
Simon Todd reacted
ReplyQuote
(@stastro)
Black Hole
Joined: 4 years ago
Posts: 183
 

@digug Awesome news George.  I use the same location for my working directory as well as PixInsight temp location. 

Data locality is important with your working directory.


   
ReplyQuote
(@mabula-admin)
Universe Admin
Joined: 7 years ago
Posts: 4120
 

Hi Simon @stastro and George @digug,

Yes indeed, the problem must be the latency for each IO call to read/write on the NAS while integrating which it will do when the working directory is on the NAS. And indeed, the working directory can be completely different than your dataset like Simon indicates. At least this shows that APP will integrate fast when the working directory is on an internal SSD with a fast interface to the CPU and RAM.

We could improve the performance for read and write to/from the NAS, it would need bigger read/write buffers for all the IO operations, then the latency will matter a bit less. Downside will be that if you would integrate 100s of frames, the total read/write buffer in the integration step can become very big because we give each cpu thread that memory buffer to speed things up and have multi-core processing working efficiently. The current buffers per cpu thread are now fixed to 4Kb which is very efficient for a local SSD and will hardly need RAM to perform. If we would increase that buffer to 256Kb for instance I think you will see much performance over the network, but for a big stack, it would require much more RAM to facilite this though.

Would it be an idea that the user can set this integration buffer size through the CFG menu?

I could also give the both of you a test version that uses bigger buffers and then you can test this to see how much a difference in performance it will be over the 10Gb netwerk. Would you like to test that?

Mabiula

This post was modified 2 months ago 2 times by Mabula-Admin

   
ReplyQuote
(@stastro)
Black Hole
Joined: 4 years ago
Posts: 183
 

Posted by: @mabula-admin

Hi Simon @stastro and George @digug,

Yes indeed, the problem must be the latency for each IO call to read/write on the NAS while integrating which it will do when the working directory is on the NAS. And indeed, the working directory can be completely different than your dataset like Simon indicates. At least this shows that APP while integrate fast when the working directory is on an internal SSD with a fast interface to the CPU and RAM.

We could improve the performance for read and write on the NA, it would need bigger read/write buffers for all the IO operations, then the latency will matter a bit less. Downside will be that if you would integrate 100s of frames, the total read/write buffer in the integration step can become very big because we give each cpu thread that memory buffer to speed things up and have multi-core processing working efficiently. The current buffers per cpu thread are now fixed to 4Kb which is very efficient for a local SSD and will hardly need RAM to perform. If we would increase that buffer to 256Kb for instance I think you will see much performance over the network, but for a big stack, it would require much more RAM to facilite this though.

Would it be an idea that the user can set this integration buffer size through the CFG menu?

I could also give the both of you a test version that uses bigger buffers and then you can test this to see how much a difference in performance it will be over the 10Gb netwerk. Would you like to test that?

Mabiula

 

Hi Mabula

Increasing the buffers from 4K to 256K might introduce some other issues for network traffic, especially if the Jumbo frames is not used, it may result in larger packet fragmentation, if you think about the normal TCP frame, it consists of a payload of 1500 bytes, Jumbo frames increases this to 9000 bytes + overheads, but not everybody has a network that supports Jumbo frames, and jumbo frames would need to be configured end to end (Client + switch + NAS).

Happy to do some testing though if you think it is worth it

Simon

 


   
ReplyQuote
(@digug)
White Dwarf
Joined: 2 years ago
Posts: 14
Topic starter  

@mabula-admin @stastro As satisficed as I am now that I'm using local storage for the working directory, it would be a neat option to set the buffer size.  I have 64GB of RAM so I have room to experiment with that.  I also have my MTU set to 9000 on both client and server adapters.  The slowest throughput (file/data transfer) I've seen to/from my NAS over this 10Gbe fiber connection was around 300MB/s and fastest was 1.3GB/s.  Adding SSD cache to my server's RAID array helps a lot too.  Would increasing this buffer also help other steps in the process?  Like reading from the source directories?

This post was modified 2 months ago by George

   
ReplyQuote
(@stastro)
Black Hole
Joined: 4 years ago
Posts: 183
 

Posted by: @digug

and fastest was 1.3GB/s. 

You're breaking the speed barrier there, 10Gbit is only capable of 1250MB/s at a maximum, and that's with absolutely no overheads whatsoever, screenshot or it never happened 🤣 🤣 🤣 🤣 🤣 

Try sending some smaller files like 2K in size, guarantee it'll fall below 300MB/sec then 🤣 🤣 🤣 🤣 🤣 

One thing to avoid is using any form of Active/Active LACP / Bonding, it causes nothing but problems, and too much overhead.  Desktop adapters perform differently to server adapters too, which is why I only use server adapters.

I generally get pretty good speed to my NAS, because my 10G network is limited to my office environment, not for the rest of the house, just wish I could do the same with my broadband, as stuck on 67Mbit until Fibre gets enabled later this year (hopefully).

My NAS is pure SSD, and NVMe at that, I share it with my Chia Farm which is all spinning disk

 


   
ReplyQuote
(@mabula-admin)
Universe Admin
Joined: 7 years ago
Posts: 4120
 

Hi Simon @stastro and George @digug,

This buffer size only affects the data integration steps. So it affect creating master calibration frames and the light frame integrations. I know that increasing the buffer size is very beneficial on conventional harddisks, less so on SSD, because of their much lower latency. So if latency over the network is slowing it down, then the increased buffers will probably help a bit 😉

I have opened an issue to have the buffer size controllable throught the CFG menu 😉 I will try to get it in there in the next release and then you can experiment a bit more with this.

Mabula


   
ReplyQuote
Share: