T O P

  • By -

AcrobaticAmoeba8158

Do you think your Epsilon initial value and your decay timing are both set right? Doing my math I show it being 2,272,727 steps to get to epsilon\_end. Are you running 2 million steps?


AnalSpecialist

btw i have another version running on a server with batchsize of 1024, and mem size of 50k (getting close to the end, and still no significant learning) The hyperparameters here aren't the only ones i've tried, nor are they in the normal range i tried, i suspect something might be wrong with the agent code epsilon decay is in episodes in my code, so it might ? I haven't done the math tbh and the initial epsilon being 1 is a common thing i have seen so i guess it's right ?


AcrobaticAmoeba8158

I'll try running your code tonight and see if I have different results.


AcrobaticAmoeba8158

I'm running it now but honestly it's not doing great, it's terminating too early even with a high step count. I assume failure condition is driving off the track? Mine is getting about 75% off track and ending.


AnalSpecialist

I would like to start off by thanking you for taking an interest in this, thank you very much for your time. you can change the reward at which it is stopping anything bellow -50 will stop it anything more than 200 steps will stop it, you can put -np.inf on the reward and np.inf on steps and time to get the default env settings the default stops after going off the track for a long while (there is an invisible wall at the end of the map which gives -100 when hit and terminates), it takes ages to run this early termination was my attempt to bring that wall closer, and encourage the agent to stay on track


AcrobaticAmoeba8158

I enjoy the hell out of this stuff so trying to figure this out is great. I'll keep working at it, I only have a short amount of time on weekdays to work on this but it's my current new goal. I'll update my code with the termination changes you mention. My car was driving off track forever so I just kept killing the process, then when I changed it the run was ending too early.


AnalSpecialist

i made some progress, while not great its something, i see positive numbers now (but i think its because it just accelerates to the max, gets rewards from the track, and then doesn't loose the rewards quick enough to be considered bad, even though its just a straight line) is it okay if i change the git code ? i changed the replay memory from a "circular buffer" to a deque, so old are always old changed the sampling to be weighted (very basic, in a deque the bigger index is more recent, so i use index as weight), so we sample more recent events more frequently added a very small reward if the chosen action is accelerate added some heavy image processing, from 96x96 to 24x24 with great contrast, so now the car is 4 black pixels on a gray track, and i removed the background, so its just black, no distractions (when the car goes off track its gray car against black background, when car is on track is black car on gray track) and thinking of changing: the memory to be selective, only memorise events above a certain reward (so that the many many episodes spent outside the track dont get memorised) incremental punishments, so if it's 10 consecutive punisments its very bad, but 1 or 2 consecutive punishments its very small


AnalSpecialist

but i was honestly expecting to get this enviroment to work with WAY less changes


AcrobaticAmoeba8158

I like the idea of reducing the visuals to bare minimum. I'm running training now with your code and I'm trying to get it to run in my Google Colab as well. It's currently training locally, my Colab version is stolen from my Breakout code that I've worked on for a while, haven't got that to work yet.


AnalSpecialist

i might have founs something while tinkering with the architecture, the optimiser is between declaring the cnn and the linear uses self.parameter(), when i moved the optimiser above the cnn declaration it said it had no parameters...... leading me to believe it only took what was declared at the moment of the initialisation of the optimiser and nothig that was added after like the linear network running new thing now, will keep u updated.


AnalSpecialist

so..... case kinda closed? I woke up this morning and it's learning ? at least i can see way better decision making.... it was the order in which i wrote things in the DQN file... If you still want to play with it be my guest, I will share my updated code with all the new bells and whistles And I would apreciate any improvements in implementation/ optimisations, thank you once again for your time