Samuel Perales

LearnAnthing.io

< Previous

5 / 25 / 24

Next >

10 min read

TLDR (click to show/hide)

I entered a Google AI Hackathon and used Gemini to build a website that generates course content and practice questions for any topic you can imagine. Through the process I found some useful techniques for hiding / parallelizing model inference and got a better sense of the differences between GPT-3.5 and Gemini and also a new benchmark for the quality of UI/UX I can produce myself.

This is my second project that tries to use AI tools for something beyond a basic chatbot or some sort assistant confined to a text box (the last on the previous page if you're interested). As a bit of extra motivation, entered it into a hackathon (namely the Google AI Hackathon). The hackathon has already ended and I didn't win, but it was still a fun experience and turned out to be the biggest web development project I've completed independently so far.

Duolingo for Anything

The idea was to make some sort of structured content generation site (similar to Epicure AI) that was actually useful for people. I came up with the idea for quiz-based learning application that was 'smart' enough to be able to reliably teach you meaningful things about literally any topic you approached it with.

This would be addressing a need for more organized online learning specifically with topics that may not have any existing formal courses or are just difficult to access materials for in general. Additionally, it would be very helpful when studying for things like certifications which often have a lot of the quality practice problems (or at least the explanations) behind paywalls.

I spent a good amount of time building this whole idea out and made a nice hosted version thats easy to get started with at learnanything.io.

Generating Content (and Trying to Hide It)

Being a part of the Google AI Hackathon, this project had to use Gemini. The actual content generation was pretty easy; using some strategic prompting to create topics, subtopics, questions, answers, explanations, etc. in explicity JSON schemas. The two main issues I came across were inconsistencies when asking Gemini for a certain JSON format (which I could work around for the most part with the removeCodeBlock method below) and the general slowness of inference that would make these responses take anywhere from 10 to 20s when just asking for everything at once.

export function removeCodeBlock(str) {
if (typeof str != "string") {
return str;
}
const jsonRegex = /^```json\n([\s\S]*)\n```$/;
const match = str.match(jsonRegex);
return match ? match[1] : str;
}

Its a little silly that something like this was neccessary. To Google's credit, they did release this functionality, but only with the Gemini 1.5 Pro model which was released part way through the hackathon.

Since the responses were in JSON format, it didn't work well to just stream them and display content as it loaded. The best ways I found to reduce the apparent loading time were by:

"Hiding inference" by generating content in the background

Parellelizing inference wherever possible

Hiding inference meant starting the content generation process well before the user might need to look at it. For example, starting on the first few questions of the next quiz before the user has actually started it; and then generating additional questions after starting a quiz after they start but before the user would see them. The pattern boiled down to generate and cache content as soon as the information needed to prompt became available, but this has a few considerations that go with it.

The most obvious is that by pre-generating lots of content, the user might never actually reach it and you 'wasted' some compute which I was fine with this because the cost of inference is fairly low. Another (which is not necessarily bad) is that your UX needs to be designed such that the user has something to do while the content is being generated in the background. This can sometimes be difficult if you expect to immediately use content after you obtain the information necessary to prompt (like with the 'searching for topics' page) which will sometimes just require a loading state. Some other ways I found that nicely accomplished this in my app included:

Generating course content in the middle of the knowledge assesment users take just before (this essentially makes the last couple questions of the assessment meaningless at the benefit of having essentially zero load time after moving to the course page).

Adding forced waits in places that made sense (in case users wanted to speed through questions by guessing and continuing, I implemented a 1.5s wait time before you could press 'Next question' which gave the endpoint in the background more time to complete and load the next content in those scenarios).

Parallelizing inference sounds like what you would expect; trying to have multiple copies of the same prompt running in order to generate several pieces of content of the same type in parallel. The main consideration here is making sure you are only parallelizing content that doesn't have a chance to conflict with other parallel results. For example, when I tried to generate all of the quiz questions for a single quiz in batches of 3 simultaneously, it was a frequent issue to see very similar questions / answers across batches.

To solve for this, I found it best to still do those sorts of things syncronously but to still parallelize where I could by working on the questions for all of the different quizzes at once (each individually in synchonous batches of size n) and the same with subtopics within a topic. Note that in doing this, it is still necessary to give the model context of all of the content it has already generated so that it doesn't repeat them.

Also note that if you can avoid the need to do the synchronous batches (and you can just do all of those in one batch) that is great, it will just come with a long run time which can mess with your attempts to hide content generation and, in my case, timeouts on your serverless functions.

How It Turned Out

I was very satisfied with the final product in that sans a couple of spots where it was more obvious or explicitly advertised to use AI, the app didn't always feel like it was using it. This was largely in part to the project not being some sort of chatbot, but also because the use of inference hiding and parallelization led most of the content to feel like it had already existed before the user clicked into it.

Beyond this, I also felt like this project was a good experience for me to spend time designing a clean UI / UX using all of the product experience I've gotten through my job. Some little niceties that I was excited to include were:

Seamless progress syncing between logged out / logged in users

The ability to try the app and locally store progress before making an account

Smart redirection to the page you were on before logging in

Complete responsiveness across all screen sizes

Some little CSS animations (button clicks, chat box opening, etc.)

Most of these I might not have included a year or two ago but felt much more natural to think of and include in this project. Was happy to see myself actual care about a clean user experience.

Here is a demo video of the website (though I encourage you to visit it yourself):

A Note on Gemini vs GPT-3.5

Having made a project with both OpenAIs GPT-3.5 and Google's Gemini, I have a few comparisons to make:

Response Formatting: As I mentioned above, it was difficult to get Gemini to behave 100% consistently with regard to the JSON format I wanted for my responses. GPT-3.5 has the response_format param which allows for this and Gemini just caught up with 1.5 Pro. It would be nice to see arbitrary response formatting from both that allow API users to define response shapes beyond JSON and a few other defaults.

Input Multimodality: Gemini offered integrated text, image, and speech comprehension which gave it an advantage to GPT-3.5. I didn't use other modalities for this project, but it was nice to have that option available. GPT-4 can also handle multi-modal inputs however and does so in a way that I thought was easier to use.

Inference Speed: Text generation with both GPT-3.5 and Gemini was slow. I noticed slightly slower times with Gemini but using either for a production application would require some of the practices I mentioned above with either hiding and/or parallelizing inferences.

API Ease-of-use: I found the GPT-3.5 API to be much easier to use and configure than Gemini's. With Gemini, everything is built into GCP which meant I needed to set up a service account and do some extra configuration in my application. By comparison, GPT-3.5 really only needed an API key for setup. The documentation for GPT-3.5 was also much better than trying to figure out how to use some of the more advanced features of Gemini.

Overall, I think GPT-3.5 was much better dev-experience-wise though Gemini did have options for multi-modal inputs. When using either however, I think using inference hiding / parallelization would improve almost any applicaiton.

Resources

learnanything.io (the final submission)
DevPost Submission
GitHub Repository
Youtube Video (a quick demo of the website's features)

Personal Blog

LearnAnthing.io

< Previous

5 / 25 / 24

Next >

10 min read

Duolingo for Anything

I spent a good amount of time building this whole idea out and made a nice hosted version thats easy to get started with at learnanything.io.

Generating Content (and Trying to Hide It)

export function removeCodeBlock(str) { if (typeof str != "string") { return str; } const jsonRegex = /^```json\n([\s\S]*)\n```$/; const match = str.match(jsonRegex); return match ? match[1] : str; }

Its a little silly that something like this was neccessary. To Google's credit, they did release this functionality, but only with the Gemini 1.5 Pro model which was released part way through the hackathon.

Since the responses were in JSON format, it didn't work well to just stream them and display content as it loaded. The best ways I found to reduce the apparent loading time were by: "Hiding inference" by generating content in the background Parellelizing inference wherever possible

Also note that if you can avoid the need to do the synchronous batches (and you can just do all of those in one batch) that is great, it will just come with a long run time which can mess with your attempts to hide content generation and, in my case, timeouts on your serverless functions.

How It Turned Out

Here is a demo video of the website (though I encourage you to visit it yourself):

A Note on Gemini vs GPT-3.5

Having made a project with both OpenAIs GPT-3.5 and Google's Gemini, I have a few comparisons to make:

Inference Speed: Text generation with both GPT-3.5 and Gemini was slow. I noticed slightly slower times with Gemini but using either for a production application would require some of the practices I mentioned above with either hiding and/or parallelizing inferences.

Overall, I think GPT-3.5 was much better dev-experience-wise though Gemini did have options for multi-modal inputs. When using either however, I think using inference hiding / parallelization would improve almost any applicaiton.

Resources

export function removeCodeBlock(str) {
if (typeof str != "string") {
return str;
}
const jsonRegex = /^```json\n([\s\S]*)\n```$/;
const match = str.match(jsonRegex);
return match ? match[1] : str;
}

Since the responses were in JSON format, it didn't work well to just stream them and display content as it loaded. The best ways I found to reduce the apparent loading time were by:

"Hiding inference" by generating content in the background

Parellelizing inference wherever possible