Cryptocurrency Forecast Project

September 6, 2023

4 mins read

Introduction

In the Cryptocurrency Forecast project, we collaborated with researchers at Bending Spoons to investigate correlation between social media activity and cryptocurrency prices. Cryptocurrency prices are very volatile in nature and are highly prevalent in online discussions. This led us to believe that using data extracted from popular websites could help us predict cryptocurrency prices.

Starting Point

In our first meeting with the researchers working at Bending Spoons, we were given data on the prices of “Bitcoin, Ethereum, Solena, Ripple, Doge and Ape” with a minute-level granularity, over the 2-year period. Instead of building models that predict the actual cryptocurrency prices, we tried to predict the model price trends (upwards, downwards, or stable). Specifically, we tried to predict by what percentage the price was going to increase in 5, 10, 15-minute intervals. This way, the models better reflect the return on investment for these digital assets. Our plan was to use the mixture of online trend data (likes, views, and other relevant data) as well as the sentiment analysis of textual data (articles, posts, and comments).

Data Collection

We decided to use AlphaVantage as our news aggregator and twitter and reddit for social media sites since they have some big cryptocurrency communities.

Twitter

Collecting data from twitter went nowhere from the start especially because of the changes in API limitation and pricings made by Elon Musk after his acquisition of the company.

Collecting data from Reddit also was not exactly easy since the official API did not return enough search results from the official API, so we decided to switch to using Pushshift API which is an unofficial archive of reddit posts and comments that allowed more control in the search requests.

However, after using the API for a long while, it turned out to be unstable and there were periods in which no posts were returned. So, we switched to a more manual way by downloading historical archives as zip files and filtering through those archives. This finally resulted in us getting posts properly. But we still needed to use the official reddit API to get the latest information about the posts and comments.

The data collected from reddit had also needed to be cleaned up since it contained a high number of bot posts and comments, which often had little relevance to our research goals.

Alpha Vantage

Data collection from AlphaVantage proceeded smoothly, thanks to their well-established API. Additionally, AlphaVantage’s data included pre-calculated sentiment values specific to the supported cryptocurrencies, however we do not know how they calculated these values.

Calculation Sentiment Values

Calculating sentiment values was the next step in our journey so we could start integrating the social media data with the cryptocurrency prices data. Since we had too many posts to run NLP on we decided to try using the number of comments in order to represent the sentiment of the reddit posts and we used the sentiment values that were already calculated by AlphaVantage, since we did not know how they calculated their sentiment values, we also used FinBERT to recalculate sentiment values of the AlphaVantage data.

Running Models

While a part of our team was still working to collect news and social media posts, we ran autoregressive models on the returns. We have tried different combinations of log transformations and amount of lag, but we found no evidence that suggests that this is a viable way.

We then turned to Neural Networks, which was the core of our project. We started by using dense networks and temporal convolution to predict the next prices, using as input the past returns and some indirect indicator for social media (number of posts in the time window and similar). We got somewhat promising results with this method and decided to try to automatically label the sentiment of the social media posts. We used FinBert, a pre-trained version of Bert, but unfortunately, we could not fine-tune it by ourselves or even evaluate all reddit posts due to the sheer amount of computational power needed. We nevertheless managed to label news articles and using this new input the model improved significantly. We managed to get a 34.5% balanced accuracy when trying to predict if the cryptocurrency was going up, down or stagnating. While the accuracy is not very high, it still means that there is a correlation which was our aim.

Literature Review

Literature highlights important specifics of the Bitcoin market, including its susceptibility to large price fluctuations. Unlike stock markets, the Bitcoin market does not close, which makes it more welcoming to noise traders.

Using the news is a viable choice for inferring the sentiment because news is professionally and precisely written, reaches a broad audience, focuses on short-term sentiment, and provides a market-level view at a certain date. However, the news does not always capture the information of insiders like corporate documents do and focuses on events in the past. The internet-expressed sentiment because the internet is unregulated, all sorts of traders can openly display their views and opinions, and it is therefore not likely to contain any added information.

Literature also describes the so-called “herd behavior,” where noise traders imitate the lucky noise traders who earned high returns. Past research shows that herd behavior increases market volatility.

One possible method for sentiment analysis is the dictionary-based approach, where words and their occurrences in the text are analyzed. However, this approach does not work well if the text contains sarcasm, jokes, or any other indirect text. Nevertheless, via regression and VAR-Granger analysis, no evidence was found that the sentiment of the news from leading international news providers affects Bitcoin returns, either positively or negatively.

There are machine-learning approaches that use neural networks. However, they require the data to be labelled.

Possible Future Development

Clean up the Reddit data even more
With more time sentiment scores for reddit posts can be calculated and weighted using comment numbers