The Stack includes 3TB of code across 30 programming languages and is 3x bigger in size than the next-largest public code dataset. BigCode only includes code that has permissive software licenses (MIT, Apache 2.0, etc) and provides an opt-out process for developers to remove their code from the dataset.
The paper shows that transformers can improve themselves autonomously through trial and error without ever updating their weights. No prompting, no finetuning. A single transformer simply collects its own data and maximizes rewards on new tasks.
CarperAI, a new research lab within the EleutherAI research collective, is releasing an "instruction-tuned" large language model trained using Reinforcement Learning from Human Feedback (RHLF). In effect, releasing an open source equivalent of GPT·3.
Meta's Universal Speech Translator project makes it possible to train AI models on languages that are primarily oral and do not have a standard or widely used writing system. Meta built and shared an AI translation system for a primarily oral language, Hokkien.
Google releases a model for generating videos from text, with prompts that can change over time, and videos that can be as long as multiple minutes.
The blueprint is intended to "help guide the design, use, and deployment of automated systems to protect the American Public.” They are currently non-regulatory, non-binding, and not yet enforceable.
Meta releases a paper for text-to-video generation using an improved model design to 1) accelerate training 2) not require paired text-video data, and 3) generated videos have greater possibilities and vastness than before.
Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It is effective with accented speech, background noise, and technical language. It works in multiple languages and can translate those languages into English.
Adept Labs announces Action Transformer 1 (ACT-1), a model that can control software from human requests. (For example, search Zillow or add new records to Salesforce.)
GitHub releases Copilot: "an AI pair programmer" to suggest and generate code.
Google releases one of the largest LLMs resulting in breakthrough capabilities on a wide range of tasks such as reasoning, multilingual tasks, and code generation.
Deepmind released a neural network-based model in the 14th Critical Assessment of protein Structure Prediction (CASP14) demonstrating high accuracy of over 80% greatly outperforming previous methods.
DALL·E is a 12-billion parameter version of GPT·3 trained to generate images from text descriptions using a dataset of text–image pairs.
BERT revolutionized NLP and paved the way for many LLM developments. It popularized the idea of pre-training on large texts and creating a general NLP model for many tasks.